We must move this initialization from xprt_start() callback, which
comes too late (after handshake completion for 1RTT session). This timer must be
usable as soon as we have packets to send/receive. Let's initialize it after
the TLS context is initialized in qc_conn_alloc_ssl_ctx(). This latter function
initializes I/O handler task (quic_conn_io_cb) to send/receive packets.
We add a new flag to mark a connection as already enqueued for acception.
This is useful for 0-RTT session where a connection is first enqueued for
acception as soon as 0-RTT RX secrets could be derived. Then as for any other
connection, we could accept one more time this connection after handshake
completion which lead to very bad side effects.
Thank you to Amaury for this nice patch.
mworker_cli_sockpair_new() is used to create the socketpair CLI listener of
the worker. Its FD is referenced in the mworker_proc structure, however,
once it's assigned to the listener the reference should be removed so we
don't use it accidentally.
The same must be done in case of errors if the FDs were already closed.
When starting HAProxy in master-worker, the master pre-allocate a struct
mworker_proc and do a socketpair() before the configuration parsing. If
the configuration loading failed, the FD are never closed because they
aren't part of listener, they are not even in the fdtab.
This patch fixes the issue by cleaning the mworker_proc structure that
were not asssigned a process, and closing its FDs.
Must be backported as far as 2.0, the srv_drop() only frees the memory
and could be dropped since it's done before an exec().
There were a few casts of list* to mt_list* that were upsetting some
old compilers (not sure about the effect on others). We had created
list_to_mt_list() purposely for this, let's use it instead of applying
this cast.
dladdr1() is used on glibc and takes a void**, but we pass it a
const ElfW(Sym)** and some compilers complain that we're aliasing.
Let's just set a may_alias attribute on the local variable to
address this. There's no need to backport this unless warnings are
reported on older distros or uncommon compilers.
At a few places in the code the switch/case ond flags are tested against
64-bit constants without explicitly being marked as long long. Some
32-bit compilers complain that the constant is too large for a long, and
other likely always use long long there. Better fix that as it's uncertain
what others which do not complain do. It may be backported to avoid doubts
on uncommon platforms if needed, as it touches very few areas.
fcgi_trace() declares fconn as a const and casts its mbuf array to
(struct buffer*), which rightfully upsets some older compilers. Better
just declare it as a writable variable and get rid of the cast. It's
harmless anyway. This has been there since 2.1 with commit 5c0f859c2
("MINOR: mux-fcgi/trace: Register a new trace source with its events")
and doens't need to be backported though it would not harm either.
Once in a while we get rid of this one. isblank() is missing on old
C libraries and only matches two values, so let's just replace it.
It was brought with this commit in 2.4:
0bf268e18 ("MINOR: server: Be more strict on the server-state line parsing")
It may be backported though it's really not important.
Compiling vars.c with gcc 4.2 shows that we're initializing some local
structs field members in a not really portable way:
src/vars.c: In function 'vars_parse_cli_set_var':
src/vars.c:1195: warning: initialized field overwritten
src/vars.c:1195: warning: (near initialization for 'px.conf.args')
src/vars.c:1195: warning: initialized field overwritten
src/vars.c:1195: warning: (near initialization for 'px.conf')
src/vars.c:1201: warning: initialized field overwritten
src/vars.c:1201: warning: (near initialization for 'rule.conf')
It's totally harmless anyway, but better clean this up.
These functions are declared as external functions in check.h and
as inline functions in check.c. Let's move them as static inline in
check.h. This appeared in 2.4 with the following commits:
4858fb2e1 ("MEDIUM: check: align agentaddr and agentport behaviour")
1c921cd74 ("BUG/MINOR: check: consitent way to set agentaddr")
While harmless (it only triggers build warnings with some gcc 4.x),
it should probably be backported where the paches above are present
to keep the code consistent.
The man page indicates that CPU_AND() and CPU_ASSIGN() take a variable,
not a const on the source, even though it doesn't make much sense. But
with older libcs, this triggers a build warning:
src/cpuset.c: In function 'ha_cpuset_and':
src/cpuset.c:53: warning: initialization discards qualifiers from pointer target type
src/cpuset.c: In function 'ha_cpuset_assign':
src/cpuset.c:101: warning: initialization discards qualifiers from pointer target type
Better stick stricter to the documented API as this is really harmless
here. There's no need to backport it (unless build issues are reported,
which is quite unlikely).
When the master process is reloaded on a new config, it will try to
connect to the previous process' socket to retrieve all known
listening FDs to be reused by the new listeners. If listeners were
removed, their unused FDs are simply closed.
However there's a catch. In case a socket fails to bind, the master
will cancel its startup and swithc to wait mode for a new operation
to happen. In this case it didn't close the possibly remaining FDs
that were left unused.
It is very hard to hit this case, but it can happen during a
troubleshooting session with fat fingers. For example, let's say
a config runs like this:
frontend ftp
bind 1.2.3.4:20000-29999
The admin wants to extend the port range down to 10000-29999 and
by mistake ends up with:
frontend ftp
bind 1.2.3.41:20000-29999
Upon restart the bind will fail if the address is not present, and the
master will then switch to wait mode without releasing the previous FDs
for 1.2.3.4:20000-29999 since they're now apparently unused. Then once
the admin fixes the config and does:
frontend ftp
bind 1.2.3.4:10000-29999
The service will start, but will bind new sockets, half of them
overlapping with the previous ones that were not properly closed. This
may result in a startup error (if SO_REUSEPORT is not enabled or not
available), in a FD number exhaustion (if the error is repeated many
times), or in connections being randomly accepted by the process if
they sometimes land on the old FD that nobody listens on.
This patch will need to be backported as far as 1.8, and depends on
previous patch:
MINOR: sock: move the unused socket cleaning code into its own function
Note that before 2.3 most of the code was located inside haproxy.c, so
the patch above should probably relocate the function there instead of
sock.c.
The startup code used to scan the list of unused sockets retrieved from
an older process, and to close them one by one. This also required that
the knowledge of the internal storage of these temporary sockets was
known from outside sock.c and that the code was copy-pasted at every
call place.
This patch moves this into sock.c under the name
sock_drop_unused_old_sockets(), and removes the xfer_sock_list
definition from sock.h since the rest of the code doesn't need to know
this.
This cleanup is minimal and preliminary to a future fix that will need
to be backported to all versions featuring FD transfers over the CLI.
In the release callback, ctx.peers was used instead of ctx.sft. Concretly,
it is not an issue because the appctx context is an union and these both
fields are structures with a unique pointer. But it will be a problem if
that changes.
This patch must be backported as far as 2.2.
When a string is converted to a domain name label, the trailing dot must be
ignored. In resolv_str_to_dn_label(), there is a test to do so. However, the
trailing dot is not really ignored. The character itself is not copied but
the string index is still moved to the next char. Thus, this trailing dot is
counted in the length of the last encoded part of the domain name. Worst,
because the copy is skipped, a garbage character is included in the domain
name.
This patch should fix the issue #1528. It must be backported as far as 2.0.
Do not use an extra DCID parameter on new_quic_cid to be able to
associated a new generated CID to a thread ID. Simply do the computation
inside the function. The API is cleaner this way.
This also has the effects to improve the apparent randomness of CIDs.
With the previous version the first byte of all CIDs are identical for a
connection which could lead to privacy issue. This version may not be
totally perfect on this aspect but it improves the situation.
The producer must know where is the tailing hole in the RX buffer
when it purges it from consumed datagram. This is done allocating
a fake datagram with the remaining number of bytes which cannot be produced
at the tail of the RX buffer as length.
The CID trees are no more attached to the listener receiver but to the
underlying datagram handlers (one by thread) which run always on the same thread.
So, any operation on these trees do not require any locking.
We copy the first octet of the original destination connection ID to any CID for
the connection calling new_quic_cid(). So this patch modifies only this function
to take a dcid as passed parameter.
As the RX buffer is not consumed by the sock i/o handler as soon as a datagram
is produced, when full an RX buffer must not be reset. The remaining room is
consumed without modifying it. The consumer has a represention of its contents:
a list of datagrams.
Rename quic_lstnr_dgram_read() to quic_lstnr_dgram_dispatch() to reflect its new role.
After calling this latter, the sock i/o handler must consume the buffer only if
the datagram it received is detected as wrong by quic_lstnr_dgram_dispatch().
The datagram handler task mark the datagram as consumed atomically setting ->buf
to NULL value. The sock i/o handler is responsible of flushing its RX buffer
before using it. It also keeps a datagram among the consumed ones so that
to pass it to quic_lstnr_dgram_dispatch() and prevent it from allocating a new one.
As mentionned in the comment, the tx_qrings and rxbufs members of
receiver struct must be pointers to pointers!
Modify the functions responsible of their allocations consequently.
Note that this code could work because sizeof rxbuf and sizeof tx_qrings
are greater than the size of pointer!
quic_dgram_read() parses all the QUIC packets from a UDP datagram. It is the best
candidate to be converted into a task, because is processing data unit is the UDP
datagram received by the QUIC sock i/o handler. If correct, this datagram is
added to the context of a task, quic_lstnr_dghdlr(), a conversion of quic_dgram_read()
into a task. This task pop a datagram from an mt_list and passes it among to
the packet handler (quic_lstnr_pkt_rcv()).
Modify the quic_dgram struct to play the role of the old quic_dgram_ctx struct when
passed to quic_lstnr_pkt_rcv().
Modify the datagram handlers allocation to set their tasks to quic_lstnr_dghdlr().
Add quic_dghdlr new struct do define datagram handler tasks, one by thread.
Allocate them and attach them to the listener receiver part calling
quic_alloc_dghdlrs_listener() newly implemented function.
Add quic_dgram new structure to store information about datagrams received
by the sock I/O handler (quic_sock_fd_iocb) and its associated pool.
Implement quic_get_dgram_dcid() to retrieve the datagram DCID which must
be the same for all the packets in the datagram.
Modify quic_lstnr_dgram_read() called by the sock I/O handler to allocate
a quic_dgram each time a correct datagram is found and add it to the sock I/O
handler rxbuf dgram list.
This function is no more used anymore, broken and uses code shared with the
listener packet parser. This is becoming anoying to continue to modify
it without testing each time we modify the code it shares with the
listener packet parser.
This is to be sure xprt functions do not manipulate the buffer struct
passed as parameter to quic_lstnr_dgram_read() from low level datagram
I/O callback in quic_sock.c (quic_sock_fd_iocb()).
Mention that the token is sent only by servers in both server and listener
packet parsers.
Remove a "TO DO" section in listener packet parser because there is nothing
more to do in this function about the token
This quic_dgram_ctx struct member is used to denote if we are parsing a new
datagram (null value), or a coalesced packet into the current datagram (non null
value). But it was never set.
There's a risk that fdtab is not 64-byte aligned. The first effect is that
it may cause false sharing between cache lines resulting in contention
when adjacent FDs are used by different threads. The second is related
to what is explained in commit "BUG/MAJOR: compiler: relax alignment
constraints on certain structures", i.e. that modern compilers might
make use of aligned vector operations to zero some entries, and would
crash. We do not use any memset() or so on fdtab, so the risk is almost
inexistent, but that's not a reason for violating some valid assumptions.
This patch addresses this by allocating 64 extra bytes and aligning the
structure manually (this is an extremely cheap solution for this specific
case). The original address is stored in a new variable "fdtab_addr" and
is the one that gets freed. This remains extremely simple and should be
easily backportable. A dedicated aligned allocator later would help, of
course.
This needs to be backported as far as 2.2. No issue related to this was
reported yet, but it could very well happen as compilers evolve. In
addition this should preserve high performance across restarts (i.e.
no more dependency on allocator's alignment).
The standalone testing code used to rely on rand(), but switching to a
xorshift generator speeds up the test by 7% which is important to
accurately measure the real impact of the LRU code itself.
Do not proceed to direct accept when creating a new quic_conn. Wait for
the QUIC handshake to succeeds to insert the quic_conn in the accept
queue. A tasklet is then woken up to call listener_accept to accept the
quic_conn.
The most important effect is that the connection/mux layers are not
instantiated at the same time as the quic_conn. This forces to delay
some process to be sure that the mux is allocated :
* initialization of mux transport parameters
* installation of the app-ops
Also, the mux instance is not checked now to wake up the quic_conn
tasklet. This is safe because the xprt-quic code is now ready to handle
the absence of the connection/mux layers.
Note that this commit has a deep impact as it changes significantly the
lower QUIC architecture. Most notably, it breaks the 0-RTT feature.
Create a new structure li_per_thread. This is uses as an array in the
listener structure, with an entry allocated per thread. The new function
li_init_per_thr is responsible of the allocation.
For now, li_per_thread contains fields only useful for QUIC listeners.
As such, it is only allocated for QUIC listeners.
Create a new type quic_accept_queue to handle QUIC connections accept.
A queue will be allocated for each thread. It contains a list of
listeners which contains at least one quic_conn ready to be accepted and
the tasklet to run listener_accept for these listeners.
Mark QUIC listeners with the flag LI_F_QUIC_LISTENER. It is set by the
proto-quic layer on the add listener callback. This allows to override
more clearly the accept callback on quic_session_accept.
The connection is allocated after finishing the QUIC handshake. Remove
handshake/L6 flags when initializing the connection as handshake is
finished with success at this stage.
Remove usage of connection in quic_conn_from_buf. As connection and
quic_conn are decorrelated, it is not logical to check connection flags
when using sendto.
This require to store the L4 peer address in quic_conn to be able to use
sendto.
This change is required to delay allocation of connection.
QUIC connections are distributed accross threads by xprt-quic according
to their CIDs. As such disable the thread selection in listener_accept
for QUIC listeners.
This prevents connection from migrating to another threads after its
allocation which can results in unexpected side-effects.
This flag is named RX_F_LOCAL_ACCEPT. It will be activated for special
receivers where connection balancing to threads is already handle
outside of listener_accept, such as with QUIC listeners.
Add a new function in mux-quic to install app-ops. For now this
functions is called during the ALPN negotiation of the QUIC handshake.
This change will be useful when the connection accept queue will be
implemented. It will be thus required to delay the app-ops
initialization because the mux won't be allocated anymore during the
QUIC handshake.
Define a new enum to represent the status of the mux/connection layer
above a quic_conn. This is important to know if it's possible to handle
application data, or if it should be buffered or dropped.
Adjust the function to check if header protection can be removed. It can
now be used both for a single packet in qc_lstnr_pkt_rcv and in the
quic_conn handler to handle buffered packets for a specific encryption
level.
When squashing commit add43fa43 ("DEBUG: pools: add new build option
DEBUG_POOL_TRACING") I managed to break the build and to fail to detect
it even after the rebase and a full rebuild :-(
David Carlier reported a build breakage on Haiku since commit
5be7c198e ("DEBUG: cli: add a new "debug dev fd" expert command")
due to O_ASYNC not being defined. Ilya also reported it broke the
build on Cygwin. It's not that portable and sometimes defined as
O_NONBLOCK for portability. But here we don't even need that, as
we already condition other flags, let's just ignore it if it does
not exist.
The poller's pipe was only registered on the read side since we don't
need to poll to write on it. But this leaves some known FDs so it's
better to also register the write side with no event. This will allow
to show them in "show fd" and to avoid dumping them as unhandled FDs.
Note that the only other type of unhandled FDs left are:
- stdin/stdout/stderr
- epoll FDs
The later can be registered upon startup though but at least a dummy
handler would be needed to keep the fdtab clean.
This command will scan the whole file descriptors space to look for
existing FDs that are unknown to haproxy's fdtab, and will try to dump
a maximum number of information about them (including type, mode, device,
size, uid/gid, cloexec, O_* flags, socket types and addresses when
relevant). The goal is to help detecting inherited FDs from parent
processes as well as potential leaks.
Some of those listed are actually known but handled so deep into some
systems that they're not in the fdtab (such as epoll FDs or inter-
thread pipes). This might be refined in the future so that these ones
become known and do not appear.
Example of output:
$ socat - /tmp/sock1 <<< "expert-mode on;debug dev fd"
0 type=tty. mod=0620 dev=0x8803 siz=0 uid=1000 gid=5 fs=0x16 ino=0x6 getfd=+0 getfl=O_RDONLY,O_APPEND
1 type=tty. mod=0620 dev=0x8803 siz=0 uid=1000 gid=5 fs=0x16 ino=0x6 getfd=+0 getfl=O_RDONLY,O_APPEND
2 type=tty. mod=0620 dev=0x8803 siz=0 uid=1000 gid=5 fs=0x16 ino=0x6 getfd=+0 getfl=O_RDONLY,O_APPEND
3 type=pipe mod=0600 dev=0 siz=0 uid=1000 gid=100 fs=0xc ino=0x18112348 getfd=+0
4 type=epol mod=0600 dev=0 siz=0 uid=0 gid=0 fs=0xd ino=0x3674 getfd=+0 getfl=O_RDONLY
33 type=pipe mod=0600 dev=0 siz=0 uid=1000 gid=100 fs=0xc ino=0x24af8251 getfd=+0 getfl=O_RDONLY
34 type=epol mod=0600 dev=0 siz=0 uid=0 gid=0 fs=0xd ino=0x3674 getfd=+0 getfl=O_RDONLY
36 type=pipe mod=0600 dev=0 siz=0 uid=1000 gid=100 fs=0xc ino=0x24af8d1b getfd=+0 getfl=O_RDONLY
37 type=epol mod=0600 dev=0 siz=0 uid=0 gid=0 fs=0xd ino=0x3674 getfd=+0 getfl=O_RDONLY
39 type=pipe mod=0600 dev=0 siz=0 uid=1000 gid=100 fs=0xc ino=0x24afa04f getfd=+0 getfl=O_RDONLY
41 type=pipe mod=0600 dev=0 siz=0 uid=1000 gid=100 fs=0xc ino=0x24af8252 getfd=+0 getfl=O_RDONLY
42 type=epol mod=0600 dev=0 siz=0 uid=0 gid=0 fs=0xd ino=0x3674 getfd=+0 getfl=O_RDONLY
This new option, when set, will cause the callers of pool_alloc() and
pool_free() to be recorded into an extra area in the pool that is expected
to be helpful for later inspection (e.g. in core dumps). For example it
may help figure that an object was released to a pool with some sub-fields
not yet released or that a use-after-free happened after releasing it,
with an immediate indication about the exact line of code that released
it (possibly an error path).
This only works with the per-thread cache, and even objects refilled from
the shared pool directly into the thread-local cache will have a NULL
there. That's not an issue since these objects have not yet been freed.
It's worth noting that pool_alloc_nocache() continues not to set any
caller pointer (e.g. when the cache is empty) because that would require
a possibly undesirable API change.
The extra cost is minimal (one pointer per object) and this completes
well with DEBUG_POOL_INTEGRITY.
This adds a caller to pool_put_to_cache() and pool_get_from_cache()
which will optionally be used to pass a pointer to their callers. For
now it's not used, only the API is extended to support this pointer.
The pool_alloc() function was already a wrapper to __pool_alloc() which
was also inlined but took a set of flags. This latter was uninlined and
moved to pool.c, and pool_alloc()/pool_zalloc() turned to macros so that
they can more easily evolve to support debugging options.
The number of call places made this code grow over time and doing only
this change saved ~1% of the whole executable's size.
The pool_free() function has become a bit big over time due to the
extra consistency checks. It used to remain inline only to deal
cleanly with the NULL pointer free that's quite present on some
structures (e.g. in stream_free()).
Here we're splitting the function in two:
- __pool_free() does the inner block without the pointer test and
becomes a function ;
- pool_free() is now a macro that only checks the pointer and calls
__pool_free() if needed.
The use of a macro versus an inline function is only motivated by an
easier intrumentation of the code later.
With this change, the code size reduces by ~1%, which means that at
this point all pool_free() call places used to represent more than
1% of the total code size.
Fix potential null pointer dereference. In fact, this case is not
possible, only a mistake in SSL ex-data initialization may cause it :
either connection is set or quic_conn, which allows to retrieve
the bind_conf.
A BUG_ON was already present but this does not cover release build.
Extract the allocation of ssl_sock_ctx from qc_conn_init to a dedicated
function qc_conn_alloc_ssl_ctx. This function is called just after
allocating a new quic_conn, without waiting for the initialization of
the connection. It allocates the ssl_sock_ctx and the quic_conn tasklet.
This change is now possible because the SSL callbacks are dealing with a
quic_conn instance.
This change is required to be able to delay the connection allocation
and handle handshake packets without it.
Allow to register quic_conn as ex-data in SSL callbacks. A new index is
used to identify it as ssl_qc_app_data_index.
Replace connection by quic_conn as SSL ex-data when initializing the QUIC
SSL session. When using SSL callbacks in QUIC context, the connection is
now NULL. Used quic_conn instead to retrieve the required parameters.
Also clean up
The same changes are conducted inside the QUIC SSL methods of xprt-quic
: connection instance usage is replaced by quic_conn.
Define a special accept cb for QUIC listeners to quic_session_accept().
This operation is conducted during the proto.add callback when creating
listeners.
A special care is now taken care when setting the standard callback
session_accept_fd() to not overwrite if already defined by the proto
layer.
Some functions of xprt-quic were still using connection instead of
quic_conn. This must be removed as the two are decorrelated : a
quic_conn can exist without a connection.
When enabled, objects picked from the cache are checked for corruption
by comparing their contents against a pattern that was placed when they
were inserted into the cache. Objects are also allocated in the reverse
order, from the oldest one to the most recent, so as to maximize the
ability to detect such a corruption. The goal is to detect writes after
free (or possibly hardware memory corruptions). Contrary to DEBUG_UAF
this cannot detect reads after free, but may possibly detect later
corruptions and will not consume extra memory. The CPU usage will
increase a bit due to the cost of filling/checking the area and for the
preference for cold cache instead of hot cache, though not as much as
with DEBUG_UAF. This option is meant to be usable in production.
It is possible that the listener is in INITIAL state, but have to probe
with Handshake packets. In this case, when entering qc_prep_pkts() there
is nothing to do. We must select the next packet number space (or encryption
level) to be able to probe with such packet type.
Remove the unsafe call to tasklet_free in quic_close. At this stage the
tasklet may already be scheduled by an other threads even after if the
quic_conn refcount is now null. It will probably cause a crash on the
next tasklet processing.
Use tasklet_kill instead to ensure that the tasklet is freed in a
thread-safe way. Note that quic_conn_io_cb is not protected by the
refcount so only the quic_conn pinned thread must kill the tasklet.
Adjust slightly refcount code decrement on quic_conn close. A new
function named quic_conn_release is implemented. This function is
responsible to remove the quic_conn from CIDs trees and decrement the
refcount to free the quic_conn once all threads have finished to work
with it.
For now, quic_close is responsible to call it so the quic_conn is
scheduled to be free by upper layers. In the future, it may be useful to
delay it to be able to send remaining data or waiting for missing ACKs
for example.
This simplify quic_conn_drop which do not require the lock anymore.
Also, this can help to free the connection more quickly in some cases.
quic_conn_drop decrement the refcount and may free the quic_conn if
reaching 0. The quic_conn should not be dereferenced again after it in
any case even for traces.
We have an anti-looping protection in process_stream() that detects bugs
that used to affect a few filters like compression in the past which
sometimes forgot to handle a read0 or a particular error, leaving a
thread looping at 100% CPU forever. When such a condition is detected,
an alert it emitted and the process is killed so that it can be replaced
by a sane one:
[ALERT] (19061) : A bogus STREAM [0x274abe0] is spinning at 2057156
calls per second and refuses to die, aborting now! Please
report this error to developers [strm=0x274abe0,3 src=unix
fe=MASTER be=MASTER dst=<MCLI> txn=(nil),0 txn.req=-,0
txn.rsp=-,0 rqf=c02000 rqa=10000 rpf=88000021 rpa=8000000
sif=EST,40008 sib=DIS,84018 af=(nil),0 csf=0x274ab90,8600
ab=0x272fd40,1 csb=(nil),0
cof=0x25d5d80,1300:PASS(0x274aaf0)/RAW((nil))/unix_stream(9)
cob=(nil),0:NONE((nil))/NONE((nil))/NONE(0) filters={}]
call trace(11):
| 0x4dbaab [c7 04 25 01 00 00 00 00]: stream_dump_and_crash+0x17b/0x1b4
| 0x4df31f [e9 bd c8 ff ff 49 83 7c]: process_stream+0x382f/0x53a3
(...)
One problem with this detection is that it used to only count the call
rate because we weren't sure how to make it more accurate, but the
threshold was high enough to prevent accidental false positives.
There is actually one case that manages to trigger it, which is when
sending huge amounts of requests pipelined on the master CLI. Some
short requests such as "show version" are sufficient to be handled
extremely fast and to cause a wake up of an analyser to parse the
next request, then an applet to handle it, back and forth. But this
condition is not an error, since some data are being forwarded by
the stream, and it's easy to detect it.
This patch modifies the detection so that update_freq_ctr() only
applies to calls made without CF_READ_PARTIAL nor CF_WRITE_PARTIAL
set on any of the channels, which really indicates that nothing is
happening at all.
This is greatly sufficient and extremely effective, as the call above
is still caught (shutr being ignored by an analyser) while a loop on
the master CLI now has no effect. The "call_rate" field in the detailed
"show sess" output will now be much lower, except for bogus streams,
which may help spot them. This field is only there for developers
anyway so it's pretty fine to slightly adjust its meaning.
This patch could be backported to stable versions in case of reports
of such an issue, but as that's unlikely, it's not really needed.
Pipelined commands easily result in request buffers to wrap, and the
master-cli parser only deals with linear buffers since it needs contiguous
keywords to look for in a list. As soon as a buffer wraps, some commands
are ignored and the parser is called in loops because the wrapped data
do not leave the buffer.
Let's take the easiest path that's already used at the HTTP layer, we
simply realign the buffer if its input wraps. This rarely happens anyway
(typically once per buffer), remains reasonably cheap and guarantees this
cannot happen anymore.
This needs to be backported as far as 2.0.
When pcli_parse_request() is called with an empty buffer, it still tries
to parse it and can go on believing it finds an empty request if the last
char before the beginning of the buffer is a '\n'. In this case it overwrites
it with a zero and processes it as an empty command, doing nothing but not
making the buffer progress. This results in an infinite loop that is stopped
by the watchdog. For a reason related to another issue (yet to be fixed),
this can easily be reproduced by pipelining lots of commands such as
"show version".
Let's add a length check after the search for a '\n'.
This needs to be backported as far as 2.0.
When a shutdown is detected on the cli, we try to execute all pending
commands first before closing the connection. It is required because
commands execution is serialized. However, when the last part is a partial
command, the cli connection is not closed, waiting for more data. Because
there is no timeout for now on the cli socket, the connection remains
infinitely in this state. And because the maxconn is set to 10, if it
happens several times, the cli socket quickly becomes unresponsive because
all its slots are waiting for more data on a closed connections.
This patch should fix the issue #1512. It must be backported as far as 2.0.
Again, we fix a reminiscence of the way we probed before probing by packet.
When we were probing by datagram we inspected <prv_pkt> to know if we were
coalescing several packets. There is no need to do that at all when probing by packet.
Furthermore this could lead to blocking situations where we want to probe but
are limited by the congestion control (<cwnd> path variable). This must not be
the case. When probing we must do it regardless of the congestion control.
If a client resend Initial CRYPTO data, this is because it did not receive all
the server Initial CRYPTO data. With this patch we prepare a fast retransmission
without waiting for the PTO timer expiration sending old Initial CRYPTO data,
coalescing them with Handshake CRYPTO if present in the same datagram. Furthermore
we send also a datagram made of previously sent Hanshashke CRYPTO data if any.
When probing, we must not take into an account the congestion control window.
This was not completely correctly implemented: qc_build_frms() could fail
because of this limit when comparing the head of the packet againts the
congestion control window. With this patch we make it fail only when
we are not probing.
This is to avoid too much PTO timer expirations for 01RTT and Handshake packet
number spaces. Furthermore we are not limited by the anti-amplication for 01RTT
packet number space. According to the RFC we can send up to two packets.
This modification should have come with this commit:
"MINOR: quic: Remove nb_pto_dgrams quic_conn struct member"
where the nb_pto_dgrams quic_conn struct member was removed.
When building packets to send, we build frames computing their sizes
to have more chance to be added to new packets. There are rare cases
where this packet coult not be built because of the congestion control
which may for instance prevent us from building a packet with padding
(retransmitted Initial packets). In such a case, the pre-built frames
were lost because added to the packet frame list but not move packet
to the packet number space they come from.
With this patch we add the frames to the packet only if it could be built
and move them back to the packet number space if not.
There is no need to use an MT_LIST to store frames to send from a packet
number space. This is a reminiscence for multi-threading support for the TX part.
As reported by @jinsubsim in github issue #1498, there is an
interoperability issue between nghttp2 as a client and a few servers
among which haproxy (in fact likely all those which do not make use
of the dynamic headers table in responses or which do not intend to
use a larger table), when reducing the header table size below 4096.
These are easily testable this way:
nghttp -v -H":method: HEAD" --header-table-size=0 https://$SITE
It will result in a compression error for those which do not start
with an HPACK dynamic table size update opcode.
There is a possible interpretation of the H2 and HPACK specs that
says that an HPACK encoder must send an HPACK headers table update
confirming the new size it will be using after having acknowledged
it, because since it's possible for a decoder to advertise a late
SETTINGS and change it after transfers have begun, the initially
advertised value might very well be seen as a first change from the
initial setting, and the HPACK spec doesn't specify the side which
causes the change that triggers a DTSU update, which was essentially
summed up in this question from nghttp2's author when this issue
was already raised 6 years ago, but which didn't really find a solid
response by then:
https://lists.w3.org/Archives/Public/ietf-http-wg/2015OctDec/0107.html
The ongoing consensus based on what some servers are doing and that aims
at limiting interoperability issues seems to be that a DTSU is expected
for each reduction from the current size, which should be reflected in
the next revision of the H2 spec:
https://github.com/httpwg/http2-spec/pull/1005
Given that we do not make use of this table we can emit a DTSU of zero
before encoding any HPACK frame. However, some clients do not support
receiving DTSU with such values (e.g. VTest) so we cannot do it
inconditionnally!
The current patch aims at sticking as close to the spec as possible by
proceeding this way:
- when a SETTINGS_HEADER_TABLE_SIZE is received, a flag is set
indicating that the value changed
- before sending any HPACK frame, this flag is checked to see if
an update is wanted and if none was sent
- in this case a DTSU of size zero is emitted and a flag is set
to mention it was emitted so that it never has to be sent again
This addresses the problem with nghttp2 without affecting VTest.
More context is available here:
https://github.com/nghttp2/nghttp2/issues/1660https://lists.w3.org/Archives/Public/ietf-http-wg/2021OctDec/0235.html
Many thanks to @jinsubsim for this report and participating to the issue
that led to an improvement of the H2 spec.
This should be backported to stable releases in a timely manner, ideally
as far as 2.4 once the h2spec update is merged, then to other versions
after a few months of observation or in case an issue around this is
reported.
Sending pipelined commands on the CLI using a semi-colon as a delimiter
has a cost that grows linearly with the buffer size, because co_getline()
is called for each word and looks up a '\n' in the whole buffer while
copying its contents into a temporary buffer.
This causes huge parsing delays, for example 3s for 100k "show version"
versus 110ms if parsed only once for a default 16k buffer.
This patch makes use of the new co_getdelim() function to support both
an LF and a semi-colon as delimiters so that it's no more needed to parse
the whole buffer, and that commands are instantly retrieved. We still
need to rely on co_getline() in payload mode as escapes and semi-colons
are not used there.
It should likely be backported where CLI processing speed matters, but
will require to also backport previous patch "MINOR: channel: add new
function co_getdelim() to support multiple delimiters". It's worth noting
that backporting it without "MEDIUM: cli: yield between each pipelined
command" would significantly increase the ratio of disconnections caused
by empty request buffers, for the sole reason that the currently slow
parsing grants more time to request data to come in. As such it would
be better to backport the patch above before taking this one.
For now we have co_getline() which reads a buffer and stops on LF, and
co_getword() which reads a buffer and stops on one arbitrary delimiter.
But sometimes we'd need to stop on a set of delimiters (CR and LF, etc).
This patch adds a new function co_getdelim() which takes a set of delimiters
as a string, and constructs a small map (32 bytes) that's looked up during
parsing to stop after the first delimiter found within the set. It also
supports an optional escape character that skips a delimiter (typically a
backslash). For the rest it works exactly like the two other variants.
Pipelining commands on the CLI is sometimes needed for batched operations
such as map deletion etc, but it causes two problems:
- some possibly long-running commands will be run in series without
yielding, possibly causing extremely long latencies that will affect
quality of service and even trigger the watchdog, as seen in github
issue #1515.
- short commands that end on a buffer size boundary, when not run in
interactive mode, will often cause the socket to be closed when
the last command is parsed, because the buffer is empty.
This patch proposes a small change to this: by yielding in the CLI applet
after processing a command when there are data left, we significantly
reduce the latency, since only one command is executed per call, and
we leave an opportunity for the I/O layers to refill the request buffer
with more commands, hence to execute all of them much more often.
With this change there's no more watchdog triggered on long series of
"del map" on large map files, and the operations are much less disturbed.
It would be desirable to backport this patch to stable versions after some
period of observation in recent versions.
While giving a fresh try to `set server ssl` (which I wrote), I realised
the behavior is a bit inconsistent. Indeed when using this command over
a server with ssl enabled for the data path but also for the health
check path we have:
- data and health check done using tls
- emit `set server be_foo/srv0 ssl off`
- data path and health check path becomes plain text
- emit `set server be_foo/srv0 ssl on`
- data path becomes tls and health check path remains plain text
while I thought the end result would be:
- data path and health check path comes back in tls
In the current code we indeed erase all connections while deactivating,
but restore only the data path while activating. I made this mistake in
the past because I was testing with a case where the health check plain
text by default.
There are several ways to solve this issue. The cleanest one would
probably be to avoid changing the health check connection when we use
`set server ssl` command, and create a new command `set server
ssl-check` to change this. For now I assumed this would be ok to simply
avoid changing the health check path and be more consistent.
This patch tries to address that and also update the documentation. It
should not break the existing usage with health check on plain text, as
in this case they should have `no-check-ssl` in defaults. Without this
patch, it makes the command unusable in an env where you have a list of
server to add along the way with initial `server-template`, and all
using tls for data and healthcheck path.
For 2.6 we should probably reconsider and add `set server ssl-check`
command for better granularity of cases.
If this solution is accepted, this patch should be backported up to >=
2.4.
The alternative solution was to restore the previous state, but I
believe this will create even more confusion in the future.
Signed-off-by: William Dauchy <wdauchy@gmail.com>
hlua_httpclient_table_to_hdrs() does a lua_pop(L, 1) at the end of the
function, this is supposed to be done in the caller and it is already be
done in hlua_httpclient_send().
This call has the consequence of poping the next parameter of the
httpclient, ignoring it.
This patch fixes the issue by removing the lua_pop(L, 1).
Must be backported in 2.5.
Forbid the httpclient to send an empty chunked client when there is no
data to send. It does happen when doing a simple GET too.
Must be backported in 2.5.
htx_add_data() is able to partially consume data. However there is a bug
when the HTX buffer is empty. The data length is not properly
adjusted. Thus, if it exceeds the HTX buffer size, no block is added. To fix
the issue, the length is now adjusted first.
This patch must be backported as far as 2.0.
If we wakeup the I/O handler before the mux is started, it is possible
it has enough time to parse the ClientHello TLS message and update the
mux transport parameters, leading to a crash.
So, we initialize ->qcc quic_conn struct member at the very last time,
when the mux if fully initialized. The condition to wakeup the I/O handler
from lstnr_rcv_pkt() is: xprt context and mux both initialized.
Note that if the xprt context is initialized, it implies its tasklet is
initialized. So, we do not check anymore this latter condition.
The stopping-list management introduced by commit d3a88c1c3 ("MEDIUM:
connection: close front idling connection on soft-stop") missed two
error paths in the H1 and H2 muxes. The effect is that if a stream
or HPACK table couldn't be allocated for these incoming connections,
we would leave with the connection freed still attached to the
stopping_list and it would never leave it, resulting in use-after-free
hence either a crash or a data corruption.
This is marked as medium as it only happens under extreme memory pressure
or when playing with tune.fail-alloc. Other stability issues remain in
such a case so that abnormal behaviors cannot be explained by this bug
alone.
This must be backported to 2.4.
Free the ssl_sock_ctx tasklet in quic_close() instead of
quic_conn_drop(). This ensures that the tasklet is destroyed safely by
the same thread.
This has no impact as the free operation was previously conducted with
care and should not be responsible of any crash.
Implement the emission of Retry packets. These packets are emitted in
response to Initial from clients without token. The token from the Retry
packet contains the ODCID from the Initial packet.
By default, Retry packet emission is disabled and the handshake can
continue without address validation. To enable Retry, a new bind option
has been defined named "quic-force-retry". If set, the handshake must be
conducted only after receiving a token in the Initial packet.
Implement the parsing of token from Initial packets. It is expected that
the token contains a CID which is the DCID from the Initial packet
received from the client without token which triggers a Retry packet.
This CID is then used for transport parameters.
Note that at the moment Retry packet emission is not implemented. This
will be achieved in a following commit.
Implement a new QUIC TLS related function
quic_tls_generate_retry_integrity_tag(). This function can be used to
calculate the AEAD tag of a Retry packet.
It is expected that quic_dgram_read() returns the total number of bytes
read. Fix the return value when the read has been successful. This bug
has no impact as in the end the return value is not checked by the
caller.
->conn quic_conn struct member is a connection struct object which may be
released from several places. With this patch we do our best to stop dereferencing
this member as much as we can.
This commit was not correct:
"MINOR: quic: Only one CRYPTO frame by encryption level"
Indeed, when receiving CRYPTO data from TLS stack for a packet number space,
there are rare cases where there is already other frames than CRYPTO data frames
in the packet number space, especially for 01RTT packet number space. This is
very often with quant as client.
There may be remaining locations where ->conn quic_conn struct member
is used. So let's reset this.
Add a trace to have an idead when this connection is released.
The build on macos was broken by recent commit df91cbd58 ("MINOR: cpuset:
switch to sched_setaffinity for FreeBSD 14 and above."), let's move the
variable declaration inside the ifdef.
A regression was introduced by commit 140f1a58 ("BUG/MEDIUM: mux-h1: Fix
splicing by properly detecting end of message"). To detect end of the
outgoing message, when the content-length is announced, we count amount of
data already sent. But only data really sent must be counted.
If the output buffer is full, we can fail to send data (fully or
partially). In this case, we must take care to only count sent
data. Otherwise we may think too much data were sent and an internal error
may be erroneously reported.
This patch should fix issues #1510 and #1511. It must be backported as far
as 2.4.
If an error is raised during the ClientHello callback on the server side
(ssl_sock_switchctx_cbk), the servername callback won't be called and
the client's SNI will not be saved in the SSL context. But since we use
the SSL_get_servername function to return this SNI in the ssl_fc_sni
sample fetch, that means that in case of error, such as an SNI mismatch
with a frontend having the strict-sni option enabled, the sample fetch
would not work (making strict-sni related errors hard to debug).
This patch fixes that by storing the SNI as an ex_data in the SSL
context in case the ClientHello callback returns an error. This way the
sample fetch can fallback to getting the SNI this way. It will still
first call the SSL_get_servername function first since it is the proper
way of getting a client's SNI when the handshake succeeded.
In order to avoid memory allocations are runtime into this highly used
runtime function, a new memory pool was created to store those client
SNIs. Its entry size is set to 256 bytes since SNIs can't be longer than
255 characters.
This fixes GitHub #1484.
It can be backported in 2.5.
Since version 2.5 the master is automatically re-executed in wait-mode
when the config is successfully loaded, puting corner cases of the wait
mode in plain sight.
When using the -x argument and with the right timing, the master will
try to get the FDs again in wait mode even through it's not needed
anymore, which will harm the worker by removing its listeners.
However, if it fails, (and it's suppose to, sometimes), the
master will exit with EXIT_FAILURE because it does not have the
MODE_MWORKER flag, but only the MODE_MWORKER_WAIT flag. With the
consequence of killing the workers.
This patch fixes the issue by restricting the use of _getsocks to some
modes.
This patch must be backported in every version supported, even through
the impact should me more harmless in version prior to 2.5.
In fact we must look for the first packet with some ack-elicting frame to
in the packet number space tree to retransmit from. Obviously there
may be already retransmit packets which are not deemed as lost and
still present in the packet number space tree for TX packets.
When receiving CRYPTO data from the TLS stack, concatenate the CRYPTO data
to the first allocated CRYPTO frame if present. This reduces by one the number
of handshake packets built for a connection with a standard size certificate.
Avoid closing idle connections if a soft stop is in progress.
By default, idle connections will be closed during a soft stop. In some
environments, a client talking to the proxy may have prepared some idle
connections in order to send requests later. If there is no proper retry
on write errors, this can result in errors while haproxy is reloading.
Even though a proper implementation should retry on connection/write
errors, this option was introduced to support back compat with haproxy <
v2.4. Indeed before v2.4, we were waiting for a last request to be able
to add a "connection: close" header and advice the client to close the
connection.
In a real life example, this behavior was seen in AWS using the ALB in
front of a haproxy. The end result was ALB sending 502 during haproxy
reloads.
This patch was tested on haproxy v2.4, with a regular reload on the
process, and a constant trend of requests coming in. Before the patch,
we see regular 502 returned to the client; when activating the option,
the 502 disappear.
This patch should help fixing github issue #1506.
In order to unblock some v2.3 to v2.4 migraton, this patch should be
backported up to v2.4 branch.
Signed-off-by: William Dauchy <wdauchy@gmail.com>
[wt: minor edits to the doc to mention other options to care about]
Signed-off-by: Willy Tarreau <w@1wt.eu>
When block by the anti-amplification limit, this is the responsability of the
client to unblock it sending new datagrams. On the server side, even if not
well parsed, such datagrams must trigger the PTO timer arming.
Switch back to QUIC_HS_ST_SERVER_HANDSHAKE state after a completed handshake
if acks must be send.
Also ensure we build post handshake frames only one time without using prev_st
variable and ensure we discard the Handshake packet number space only one time.
We need to be able to decrypt late Handshake packets after the TLS secret
keys have been discarded. If not the peer send Handshake packet which have
not been acknowledged. But for such packets, we discard the CRYPTO data.
According to RFC 9002 par. 6.2.3. when receving duplicate Initial CRYPTO
data a server may a packet containing non unacknowledged before the PTO
expiry.
These tests were there to initiate PTO probing but they are not correct.
Furthermore they may break the PTO probing process and lead to useless packet
building.
RFC 9002 5.3. Estimating smoothed_rtt and rttvar:
MUST use the lesser of the acknowledgment delay and the peer's max_ack_delay
after the handshake is confirmed.
When a filter is attached on a stream, the FLT_END analyser must not be
removed from the response channel on L7 retry. It is especially important
because CF_FLT_ANALYZE flag is still set. This means the synchronization
between the two sides when the filter ends can be blocked. Depending on the
timing, this can freeze the stream infinitely or lead to a spinning loop.
Note that the synchronization between the two sides at the end of the
analysis was introduced because the stream was reused in HTTP between two
transactions. But, since the HTX was introduced, a new stream is created for
each transaction. So it is probably possible to remove this step for 2.2 and
higher.
This patch must be backported as far as 2.0.
With this patch pool_evict_last_items builds clusters of up to
CONFIG_HAP_POOL_CLUSTER_SIZE entries so that accesses to the shared
pools are reduced by CONFIG_HAP_POOL_CLUSTER_SIZE and the inter-
thread contention is reduced by as much..
Since previous patch we can forcefully evict multiple objects from the
local cache, even when evicting basd on the LRU entries. Let's define
a compile-time configurable setting to batch releasing of objects. For
now we set this value to 8 items per round.
This is marked medium because eviction from the LRU will slightly change
in order to group the last items that are freed within a single cache
instead of accurately scanning only the oldest ones exactly in their
order of appearance. But this is required in order to evolve towards
batched removals.
We currently have two functions to evict cold objects from local caches:
pool_evict_from_local_cache() to evict from a single cache, and
pool_evict_from_local_caches() to evict oldest objects from all caches.
The new function pool_evict_last_items() focuses on scanning oldest
objects from a pool and releasing a predefined number of them, either
to the shared pool or to the system. For now they're evicted one at a
time, but the next step will consist in creating clusters.
In order to support batched allocations and releases, we'll need to
prepare chains of items linked together and that can be atomically
attached and detached at once. For this we implement a "down" pointer
in each pool_item that points to the other items belonging to the same
group. For now it's always NULL though freeing functions already check
them when trying to release everything.
In pool_evict_from_local_cache() we used to check for room left in the
pool for each and every object. Now we compute the value before entering
the loop and keep into a local list what has to be released, and call
the OS-specific functions for the other ones.
It should already save some cycles since it's not needed anymore to
recheck for the pool's filling status. But the main expected benefit
comes from the ability to pre-construct a list of all releasable
objects, that will later help with grouping them.
In order to support batch allocation from/to shared pools, we'll have to
support a specific representation for pool objects. The new pool_item
structure will be used for this. For now it only contains a "next"
pointer that matches exactly the current storage model. The few functions
that deal with the shared pool entries were adapted to use the new type.
There is no functionality difference at this point.
Instead of letting pool_put_to_shared_cache() pass the object to the
underlying OS layer when there's no more room, let's have the caller
check if the pool is full and either call pool_put_to_shared_cache()
or call pool_free_nocache().
Doing this sensibly simplifies the code as this function now only has
to deal with a pool and an item and only for cases where there are
local caches and shared caches. As the code was simplified and the
calls more isolated, the function was moved to pool.c.
Note that it's only called from pool_evict_from_local_cache{,s}() and
that a part of its logic might very well move there when dealing with
batches.
One of the thread scaling challenges nowadays for the pools is the
contention on the shared caches. There's never any situation where we
have a shared cache and no local cache anymore, so we can technically
afford to transfer objects from the shared cache to the local cache
before returning them to the user via the regular path. This adds a
little bit more work per object per miss, but will permit batch
processing later.
This patch simply moves pool_get_from_shared_cache() to pool.c under
the new name pool_refill_local_from_shared(), and this function does
not return anything but it places the allocated object at the head of
the local cache.
The POOL_LINK macro is now only used for debugging, and it still requires
ifdefs around, which needlessly complicates the code. Let's replace it
and the calling code with a new pair of macros: POOL_DEBUG_SET_MARK()
and POOL_DEBUG_CHECK_MARK(), that respectively store and check the pool
pointer in the extra location at the end of the pool. This removes 4
pairs of ifdefs in the middle of the code.
This practice relying on POOL_LINK() dates from the era where there were
no pool caches, but given that the structures are a bit more complex now
and that pool caches do not make use of this feature, it is totally
useless since released elements have already been overwritten, and yet
it complicates the architecture and prevents from making simplifications
and optimizations. Let's just get rid of this feature. The pointer to
the origin pool is preserved though, as it helps detect incorrect frees
and serves as a canary for overflows.
For an unknown reason, despite the comment stating that we were evicting
oldest objects first from the local caches, due to the use of LIST_NEXT,
the newest were evicted, since pool_put_to_cache() uses LIST_INSERT().
Some tests on 16 threads show that evicting oldest objects instead can
improve performance by 0.5-1% especially when using shared pools.
This patch unlinks and frees the ckch instance linked to a server during
the free of this server.
This could have locked certificates in a "Used" state when removing
servers dynamically from the CLI. And could provoke a segfault once we
try to dynamically update the certificate after that.
This must be backported as far as 2.4.
A lot of free are missing in ssl_sock_free_srv_ctx(), this could result
in memory leaking when removing dynamically a server via the CLI.
This must be backported in every branches, by removing the fields that
does not exist in the previous branches.
This bug was introduced by d817dc73 ("MEDIUM: ssl: Load client
certificates in a ckch for backend servers") in which the creation of
the SSL_CTX for a server was moved to the configuration parser when
using a "crt" keyword instead of being done in ssl_sock_prepare_srv_ctx().
The patch 0498fa40 ("BUG/MINOR: ssl: Default-server configuration ignored by
server") made it worse by setting the same SSL_CTX for every servers
using a default-server. Resulting in any SSL option on a server applied
to every server in its backend.
This patch fixes the issue by reintroducing a string which store the
path of certificate inside the server structure, and loading the
certificate in ssl_sock_prepare_srv_ctx() again.
This is a quick fix to backport, a cleaner way can be achieve by always
creating the SSL_CTX in ssl_sock_prepare_srv_ctx() and splitting
properly the ssl_sock_load_srv_cert() function.
This patch fixes issue #1488.
Must be backported as far as 2.4.
This is a second help to dump loaded library names late at boot, once
external code has already been initialized. The purpose is to provide
a format that makes it easy to pass to "tar" to produce an archive
containing the executable and the list of dependencies. For example
if haproxy is started as "haproxy -f foo.cfg", a config check only
will suffice to quit before starting, "-q" will be used to disable
undesired output messages, and -dL will be use to dump libraries.
This will result in such a command to trivially produce a tarball
of loaded libraries:
./haproxy -q -c -dL -f foo.cfg | tar -T - -hzcf archive.tgz
Many times core dumps reported by users who experience trouble are
difficult to exploit due to missing system libraries. Sometimes,
having just a list of loaded libraries and their respective addresses
can already provide some hints about some problems.
This patch makes a step in that direction by adding a new "show libs"
command that will try to enumerate the list of object files that are
loaded in memory, relying on the dynamic linker for this. It may also
be used to detect that some foreign code embarks other undesired libs
(e.g. some external Lua modules).
At the moment it's only supported on glibc when USE_DL is set, but it's
implemented in a way that ought to make it reasonably easy to be extended
to other platforms.
The approach used for skipping conn_cur in commit db2ab8218 ("MEDIUM:
stick-table: never learn the "conn_cur" value from peers") was wrong,
it only works with simple tables but as soon as frequency counters or
arrays are exchanged after conn_cur, the stream is desynchronized and
incorrect values are read. This is because the fields have a variable
length depending on their types and cannot simply be skipped by a
"continue" statement.
Let's change the approach to make sure we continue to completely parse
these local-only fields, and only drop the value at the moment we're
about to store them, since this is exactly the intent.
A simpler approach could consist in having two sets of stktable_data_ptr()
functions, one for retrieval and one for storage, and to make the store
function return a NULL pointer for local types. For now this doesn't
seem worth the trouble.
This fixes github issue #1497. Thanks to @brenc for the reproducer.
This must be backported to 2.5.
A subtle change of target address allocation was introduced with commit
68cf3959b ("MINOR: backend: rewrite alloc of stream target address") in
2.4. Prior to this patch, a target address was allocated by function
assign_server_address() only if none was previously allocated. After
the change, the allocation became unconditional. Most of the time it
makes no difference, except when we pass multiple times through
connect_server() with SF_ADDR_SET cleared.
The most obvious fix would be to avoid allocating that address there
when already set, but the root cause is that since introduction of
dynamically allocated addresses, the SF_ADDR_SET flag lies. It can
be cleared during redispatch or during a queue redistribution without
the address being released.
This patch instead gives back all its correct meaning to SF_ADDR_SET
and guarantees that when not set no address is allocated, by freeing
that address at the few places the flag is cleared. The flag could
even be removed so that only the address is checked but that would
require to touch many areas for no benefit.
The easiest way to test it is to send requests to a proxy with l7
retries enabled, which forwards to a server returning 500:
defaults
mode http
timeout client 1s
timeout server 1s
timeout connect 1s
retry-on all-retryable-errors
retries 1
option redispatch
listen proxy
bind *:5000
server app 0.0.0.0:5001
frontend dummy-app
bind :5001
http-request return status 500
Issuing "show pools" on the CLI will show that pool "sockaddr" grows
as requests are redispatched, and remains stable with the fix. Even
"ps" will show that the process' RSS grows by ~160B per request.
This fix will need to be backported to 2.4. Note that before 2.5,
there's no strm->si[1].dst, strm->target_addr must be used instead.
This addresses github issue #1499. Special thanks to Daniil Leontiev
for providing a well-documented reproducer.
Properly initialized the ssl_sock_ctx pointer in qc_conn_init. This is
required to avoid to set an undefined pointer in qc.xprt_ctx if argument
*xprt_ctx is NULL.
Implement a refcount on quic_conn instance. By default, the refcount is
0. Two functions are implemented to manipulate it.
* qc_conn_take() which increments the refcount
* qc_conn_drop() which decrements it. If the refcount is 0 *BEFORE*
the substraction, the instance is freed.
The refcount is incremented on retrieve_qc_conn_from_cid() or when
allocating a new quic_conn in qc_lstnr_pkt_rcv(). It is substracted most
notably by the xprt.close operation and at the end of
qc_lstnr_pkt_rcv(). The increments/decrements should be conducted under
the CID lock to guarantee thread-safety.
The timer task is attached to the connection-pinned thread. Only this
thread can delete it. With the future refcount implementation of
quic_conn, every thread can be responsible to remove the quic_conn via
quic_conn_free(). Thus, the timer task deletion is moved from the
calling function quic_close().
Big refactoring on xprt-quic. A lot of functions were using the
ssl_sock_ctx as argument to only access the related quic_conn. All these
arguments are replaced by a quic_conn parameter.
As a convention, the quic_conn instance is always the first parameter of
these functions.
This commit is part of the rearchitecture of xprt-quic layers and the
separation between xprt and connection instances.
Remove the shortcut to use the INITIAL encryption level when removing
header protection on first connection packet.
This change is useful for the following change which removes
ssl_sock_ctx in argument lists in favor of the quic_conn instance.
Add a pointer in quic_conn to its related ssl_sock_ctx. This change is
required to avoid to use the connection instance to access it.
This commit is part of the rearchitecture of xprt-quic layers and the
separation between xprt and connection instances. It will be notably
useful when the connection allocation will be delayed.
free_quic_conn_cids() was called in quic_build_post_handshake_frames()
if an error occured. However, the only error is an allocation failure of
the CID which does not required to call it.
This change is required for future refcount implementation. The CID lock
will be removed from the free_quic_conn_cids() and to the caller.
When a quic_conn is found in the DCID tree, it can be removed from the
first ODCID tree. However, this operation must absolutely be run under a
write-lock to avoid race condition. To avoid to use the lock too
frequently, node.leaf_p is checked. This value is set to NULL after
ebmb_delete.
Some applications may send some information about the reason why they decided
to close a connection. Add them to CONNECTION_CLOSE frame traces.
Take the opportunity of this patch to shorten some too long variable names
without any impact.
Add traces about important frame types to chunk_tx_frm_appendf()
and call this function for any type of frame when parsing a packet.
Move it to quic_frame.c
Since this case was already met previously with commit 655dec81b
("BUG/MINOR: backend: do not set sni on connection reuse"), let's make
sure that we don't change reused connection settings. This could be
generalized to most settings that are only in effect before the handshake
in fact (like set_alpn and a few other ones).
During 2.4-dev, support for malloc_trim() was implemented to ease
release of memory in a stopping process. This was found to be quite
effective and later backported to 2.3.7.
Then it was found that sometimes malloc_trim() could take a huge time
to complete it if was competing with other threads still allocating and
releasing memory, reason why it was decided in 2.5-dev to move
malloc_trim() under the thread isolation that was already in place in
the shared pool version of pool_gc() (this was commit 26ed1835).
However, other instances of pool_gc() that used to call malloc_trim()
were not updated since they were not using thread isolation. Currently
we have two other such instances, one for when there is absolutely no
pool and one for when there are only thread-local pools.
Christian Ruppert reported in GH issue #1490 that he's sometimes seeing
and old process die upon reload when upgrading from 2.3 to 2.4, and
that this happens inside malloc_trim(). The problem is that since
2.4-dev11 with commit 0bae07592 we detect modern libc that provide a
faster thread-aware allocator and do not maintain shared pools anymore.
As such we're using again the simpler pool_gc() implementations that do
not use thread isolation around the malloc_trim() call.
All this code was cleaned up recently and the call moved to a new
function trim_all_pools(). This patch implements explicit thread isolation
inside that function so that callers do not have to care about this
anymore. The thread isolation is conditional so that this doesn't affect
the one already in place in the larger version of pool_gc(). This way it
will solve the problem for all callers.
This patch must be backported as far as 2.3. It may possibly require
some adaptations. If trim_all_pools() is not present, copy-pasting the
tests in each version of pool_gc() will have the same effect.
Thanks to Christian for his detailed report and his testing.
This is the same treatment for bidi and uni STREAM frames. This is a duplication
code which should me remove building a function for both these types of streams.
The connection instance has been replaced by a quic_conn as first
argument to QUIC traces. It is possible to report the quic_conn instance
in the qc_new_conn(), contrary to the connection which is not
initialized at this stage.
Replace the connection instance for first argument of trace callback by
a quic_conn instance. The QUIC trace module is properly initialized with
the first argument refering to a quic_conn.
Replace every connection instances in TRACE_* macros invocation in
xprt-quic by its related quic_conn. In some case, the connection is
still used to access the quic_conn. It may cause some problem on the
future when the connection will be completly separated from the xprt
layer.
This commit is part of the rearchitecture of xprt-quic layers and the
separation between xprt and connection instances.
Prepare trace support for quic_conn instances as argument. This will be
used by the xprt-quic layer in replacement of the connection.
This commit is part of the rearchitecture of xprt-quic layers and the
separation between xprt and connection instances.
Add const qualifier on arguments of several dump functions used in the
trace callback. This is required to be able to replace the first trace
argument by a quic_conn instance. The first argument is a const pointer
and so the members accessed through it must also be const.
Add a new member in ssl_sock_ctx structure to reference the quic_conn
instance if used in the QUIC stack. This member is initialized during
qc_conn_init().
This is needed to be able to access to the quic_conn without relying on
the connection instance. This commit is part of the rearchitecture of
xprt-quic layers and the separation between xprt and connection
instances.
Move qcc_get_qcs() function from xprt_quic.c to mux_quic.c. This
function is used to retrieve the qcs instance from a qcc with a stream
id. This clearly belongs to the mux-quic layer.
Use the convention of naming quic_conn instance as qc to not confuse it
with a connection instance. The changes occured for qc_parse_pkt_frms(),
qc_build_frms() and qc_do_build_pkt().
The QUIC connection I/O handler qc_conn_io_cb() could be called just after
qc_pkt_insert() have inserted a packet in a its tree, and before qc_pkt_insert()
have incremented the reference counter to this packet. As qc_conn_io_cb()
decrement this counter, the packet could be released before qc_pkt_insert()
might increment the counter, leading to possible crashes when trying to do so.
So, let's make qc_pkt_insert() increment this counter before inserting the packet
it is tree. No need to lock anything for that.
Add a function to process all STREAM frames received and ordered
by their offset (qc_treat_rx_strm_frms()) and modify
qc_handle_bidi_strm_frm() consequently.
There were empty lines in the output of the CLI's "show ssl
ocsp-response" command (after the certificate ID and between two
certificates). This patch removes them since an empty line should mark
the end of the output.
Must be backported in 2.5.
With the DCID refactoring, the locking is more centralized. It is
possible to simplify the code for removal of a quic_conn from the ODCID
tree.
This operation can be conducted as soon as the connection has been
retrieved from the DCID tree, meaning that the peer now uses the final
DCID. Remove the bit to flag a connection for removal and just uses
ebmb_delete() on each sucessful lookup on the DCID tree. If the
quic_conn has already been removed, it is just a noop thanks to
eb_delete() implementation.
A new function named qc_retrieve_conn_from_cid() now contains all the
code to retrieve a connection from a DCID. It handle all type of packets
and centralize the locking on the ODCID/DCID trees.
This simplify the qc_lstnr_pkt_rcv() function.
If an UDP datagram contains multiple QUIC packets, they must all use the
same DCID. The datagram context is used partly for this.
To ensure this, a comparison was made on the dcid_node of DCID tree. As
this is a comparison based on pointer address, it can be faulty when
nodes are removed/readded on the same pointer address.
Replace this comparison by a proper comparison on the DCID data itself.
To this end, the dgram_ctx structure contains now a quic_cid member.
For first Initial packets, the socket source dest address is
concatenated to the DCID. This is used to be able to differentiate
possible collision between several clients which used the same ODCID.
Refactor the code to manage DCID and the concatenation with the address.
Before this, the concatenation was done on the quic_cid struct and its
<len> field incremented. In the code it is difficult to differentiate a
normal DCID with a DCID + address concatenated.
A new field <addrlen> has been added in the quic_cid struct. The <len>
field now only contains the size of the QUIC DCID. the <addrlen> is
first initialized to 0. If the address is concatenated, it will be
updated with the size of the concatenated address. This now means we
have to explicitely used either cid.len or cid.len + cid.addrlen to
access the DCID or the DCID + the address. The code should be clearer
thanks to this.
The field <odcid_len> in quic_rx_packet struct is now useless and has
been removed. However, a new parameter must be added to the
qc_new_conn() function to specify the size of the ODCID addrlen.
On haproxy implementation, generated DCID are on 8 bytes, the minimal
value allowed by the specification. Rename the constant representing
this size to inform that this is haproxy specific.
All operation on the ODCID/DCID trees must be conducted under a
read-write lock. Add a missing read-lock on the lookup operation inside
listener handler.
The packet number space flags were mixed with the connection level flags.
This leaded to ACK to be sent at the connection level without regard to
the underlying packet number space. But we want to be able to acknowleged
packets for a specific packet number space.
This is required if we do not want to make haproxy crash during zerortt
interop runner test which makes a client open multiple streams with
long request paths.
A client sends a 0-RTT data packet after an Initial one in the same datagram.
We must be able to parse such packets just after having parsed the Initial packets.
Export the code responsible which set the ->app_ops structure into
quic_set_app_ops() function. It must be called by the TLS callback which
selects the application (ssl_sock_advertise_alpn_protos) so that
to be able to build application packets after having received 0-RTT data.
The TLS does not provide us with TX secrets after we have provided it
with 0-RTT data. This is logic: the server does not need to send 0-RTT
data. We must skip the section where such secrets are derived if we do not
want to close the connection with a TLS alert.
Enable 0-RTT at the TLS context level:
RFC 9001 4.6.1. Enabling 0-RTT
Accordingly, the max_early_data_size parameter is repurposed to hold a
sentinel value 0xffffffff to indicate that the server is willing to accept
QUIC 0-RTT data.
At the SSL connection level, we must call SSL_set_quic_early_data_enabled().
This field is no more useful. Modify the traces consequently.
Also initialize ->pn_node.key value to -1, which is an illegal value
for QUIC packet number, and display it in traces if different from -1.
If not handled by qc_parse_pkt_frms(), the packet which contains it is dropped.
Add only a trace when parsing this frame at this time.
Also modify others to reduce the traces size and have more information about streams.
This patch adds the possibility to add a set of conditions to a set-var
call, be it a converter or an action (http-request or http-response
action for instance). The conditions must all be true for the given
set-var call for the variable to actually be set. If any of the
conditions is false, the variable is left untouched.
The managed conditions are the following : "ifexists", "ifnotexists",
"ifempty", "ifnotempty", "ifset", "ifnotset", "ifgt", "iflt". It is
possible to combine multiple conditions in a single set-var call since
some of them apply to the variable itself, and some others to the input.
This patch does not change the fact that variables of scope proc are
still created during configuration parsing, regardless of the conditions
that might be added to the set-var calls in which they are mentioned.
For instance, such a line :
http-request set-var(proc.foo,ifexists) int(5)
would not prevent the creation of the variable during init, and when
actually reaching this line during runtime, the proc.foo variable would
already exist. This is specific to the proc scope.
These new conditions mean that a set-var could "fail" for other reasons
than memory allocation failures but without clearing the contents of the
variable.
This patch adds the parsing of the optional condition parameters that
can be passed to the set-var and set-var-fmt actions (http as well as
tcp). Those conditions will not be taken into account yet in the var_set
function so conditions passed as parameters will not have any effect.
Since actions do not benefit from the parameter preparsing that
converters have, parsing conditions needed to be done by hand.
This patch adds the parsing of the optional condition parameters that
can be passed to the set-var converter. Those conditions will not be
taken into account yet in the var_set function so conditions passed as
parameters will not have any effect. This is true for any condition
apart from the "ifexists" one that is also used to replace the
VF_UPDATEONLY flag that was used to prevent proc scope variable creation
from a LUA module.
When calling var_set on a variable of type string (SMP_T_STR, SMP_T_BIN
or SMP_T_METH), the contents of the variable were freed directly. When
adding conditions to set-var calls we might have cases in which the
contents of an existing variable should be kept unchanged so the freeing
of the internal buffers is delayed in the var_set function (so that we
can bypass it later).
The type of a newly created variable was not initialized. This patch
sets it to SMP_T_ANY by default. This will be required when conditions
can be added to a set-var call because we might end up creating a
variable without setting it yet.
The vars_set_by_name_ifexist function was created to avoid creating too
many variables from a LUA module. This was made thanks to the
VF_UPDATEONLY flags which prevented variable creation in the var_set
function. Since commit 3a4bedccc ("MEDIUM: vars: replace the global name
index with a hash") this limitation was restricted to 'proc' scope
variables only.
This patch simply moves the scope test to the vars_set_by_name_ifexist
function instead of the var_set function.
allowing for all platforms supporting cpu affinity to have a chance
to detect the cpu topology from a given valid node (e.g.
DragonflyBSD seems to be NUMA aware from a kernel's perspective
and seems to be willing start to provide userland means to get
proper info).
numa_detect_topology() is always define now if USE_CPU_AFFINITY is
activated. For the moment, only on Linux an actual implementation is
provided. For other platforms, it always return 0.
This change has been made to easily add implementation of NUMA detection
for other platforms. The phrasing of the documentation has also been
edited to removed the mention of Linux-only on numa-cpu-mapping
configuration option.
This patch implements a simple "show version" command which returns
the version of the current process.
It's available from the master and the worker processes, so it is easy
to check if the master and the workers have the same version.
This is a minor patch that really improve compatibility checks
for scripts.
Could be backported in haproxy version as far as 2.0.
The master process encounter a crash when trying to access an old
process which left from the master CLI.
To reproduce the problem, you need a prompt to a previous worker, then
wait for this worker to leave, once it left launch a command from this
prompt. The s->target is then filled with a NULL which is dereferenced
when trying to connect().
This patch fixes the problem by checking if s->target is NULL.
Must be backported as far as 2.0.
The htx variable is only initialized if we have received a HTTP/3
HEADERS frame. Else it must not be dereferenced.
This should fix the compilation on CI with gcc.
src/h3.c: In function ‘h3_decode_qcs’:
src/h3.c:224:14: error: ‘htx’ may be used uninitialized in this function
[-Werror=maybe-uninitialized]
224 | htx->flags |= HTX_FL_EOM
Initialize all flow control members on the qcc instance. Without this,
the value are undefined and it may be possible to have errors about
reached streams limit.
The xprt layer is reponsible to notify the mux of a CONNECTION_CLOSE
reception. In this case the flag QC_CF_CC_RECV is positionned on the
qcc and the mux tasklet is waken up.
One of the notable effect of the QC_CF_CC_RECV is that each qcs will be
released even if they have remaining data in their send buffers.
A qcs is not freed if there is remaining data in its buffer. In this
case, the flag QC_SF_DETACH is positionned.
The qcc io handler is responsible to remove the qcs if the QC_SF_DETACH
is set and their buffers are empty.
When a server is dynamically added via the CLI with a custom id, the key
used to insert it in the backend's tree of used names is not initialized.
The server id must be used but it is only used when no custom id is
provided. Thus, with a custom id, HAProxy crashes.
Now, the server id is always used to init this key, to be able to insert the
server in the corresponding tree.
This patch should fix the issue #1481. It must be backported as far as 2.4.
It is now possible to perform captures on the response when
http-after-response rules are evaluated. It may be handy to capture headers
from responses generated by HAProxy.
This patch is trivial, it may be backported if necessary.
Set the HTX EOM flag on RX the app layer. This is required to notify
about the end of the request for the stream analyzers, else the request
channel never goes to MSG_DONE state.
Remove a wrong comparaison with the same buffer on both sides. In any
cases, the FIN is properly set by qcs_push_frame only when the payload
has been totally emptied.
On h09 app layer, if there is not enought size in the tx buffer, the
transfer is interrupted and the flag QC_SF_BLK_MROOM is positionned.
The transfer is woken up by the mux when new buffer size becomes
available.
This ensure that no data is silently discarded during transfer. Without
this, once the buffer is full the data were removed and thus not send to
the client resulting in a truncating payload.
Remove qc_eval_pkt() which has come with the multithreading support. It
was there to evaluate the length of a TX packet before building. We could
build from several thread TX packets without consuming a packet number for nothing (when
the building failed). But as the TX packet building functions are always
executed by the same thread, the one attached to the connection, this does
not make sense to continue to use such a function. Furthermore it is buggy
since we had to recently pad the TX packet under certain circumstances.
After the handshake has succeeded, we must delete any remaining
Initial or Handshake packets from the RX buffer. This cannot be
done depending on the state the connection (->st quic_conn struct
member value) as the packet are not received/treated in order.
Add a null byte to the end of the RX buffer to notify the consumer there is no
more data to treat.
Modify quic_rx_packet_pool_purge() which is the function which remove the
RX packet from the buffer.
Also rename this function to quic_rx_pkts_del().
As the RX packets may be accessed by the QUIC connection handler (quic_conn_io_cb())
the function responsible of decrementing their reference counters must not
access other information than these reference counters! It was a very bad idea
to try to purge the RX buffer asap when executing this function.
Do not leave in the RX buffer packets with CRYPTO data which were
already received. We do this when parsing CRYPTO frame. If already
received we must not consider such frames as if they were not received
in order! This had as side effect to interrupt the transfer of long streams
(ACK frames not parsed).
Handle the case when the app layer sending buffer is full. A new flag
QC_SF_BLK_MROOM is set in this case and the transfer is interrupted. It
is expected that then the conn-stream layer will subscribe to SEND.
The MROOM flag is reset each time the muxer transfer data from the app
layer to its own buffer. If the app layer has been subscribed on SEND it
is woken up.
On qc_send, data are transferred for each stream from their qcs.buf to
the qcs.xprt_buf. Wake up the xprt to warn about new data available for
transmission.
The streams data are transferred from the qcs.buf to the qcs.xprt_buf
during qc_send. If the xprt_buf is not empty and not all data can be
transferred, subscribe the connection on the xprt for sending.
The mux will be woken up by the xprt when the xprt_buf will be cleared.
This happens on ACK reception.
Implement the subscription in the mux on the qcs instance.
Subscribe is now used by the h3 layer when receiving an incomplete frame
on the H3 control stream. It is also used when attaching the remote
uni-directional streams on the h3 layer.
In the qc_send, the mux wakes up the qcs for each new transfer executed.
This is done via the method qcs_notify_send().
The xprt wakes up the qcs when receiving data on unidirectional streams.
This is done via the method qcs_notify_recv().
Set the QC_SF_FIN_STREAM on the app layers (h3 / hq-interop) when
reaching the HTX EOM. This is used to warn the mux layer to set the FIN
on the QUIC stream.
Implement qc_release. This function is called by the upper layer on
connection close. For the moment, this only happens on client timeout.
This functions is used the free a qcs instance. If all bidirectional
streams are freed, the qcc instance and the connection are purged.
Re-implement the QUIC mux. It will reuse the mechanics from the previous
mux without all untested/unsupported features. This should ease the
maintenance.
Note that a lot of features are broken for the moment. They will be
re-implemented on the following commits to have a clean commit history.
The app layer is initialized after the handshake completion by the XPRT
stack. Call the finalize operation just after that.
Remove the erroneous call to finalize by the mux in the TPs callback as
the app layer is not yet initialized at this stage.
This should fix the missing H3 settings currently not emitted by
haproxy.
Add BUG_ON statement when handling a non implemented frames on the
control stream. This is required because frames must be removed from the
RX buffer or else it will stall the buffer.
At the moment the reason_phrase member of a
quic_connection_close/quic_connection_close_app structure is not
allocated. Comment the memcpy to it to avoid segfault.
Many ARMv8 processors also support Aarch32 and can run armv7 and even
thumb2 code. While armv8 compilers will not emit these instructions,
armv7 compilers that are aware of these processors will do. For
example, using gcc built for an armv7 target and passing it
"-mcpu=cortex-a72" or "-march=armv8-a+crc" will result in the CRC32
instruction to be used.
In this case the current assembly code fails because with the ARM and
Thumb2 instruction sets there is no such "%wX" half-registers. We need
to use "%X" instead as the native 32-bit register when running with a
32-bit instruction set, and use "%wX" when using the 64-bit instruction
set (A64).
This is slz upstream commit fab83248612a1e8ee942963fe916a9cdbf085097
At many places we use construct such as:
if (objt_server(blah))
do_something(objt_server(blah));
At -O2 the compiler manages to simplify the operation and see that the
second one returns the same result as the first one. But at -O1 that's
not always the case, and the compiler is able to emit a second
expression and sees the potential null that results from it, and may
warn about a potential null deref (e.g. with gcc-6.5). There are two
solutions to this:
- either the result of the first test has to be passed to a local
variable
- or the second reference ought to be unchecked using the __objt_*
variant.
This patch fixes all occurrences at once by taking the second approach
(the least intrusive). For constructs like:
objt_server(blah) ? objt_server(blah)->name : "no name"
a macro could be useful. It would for example take the object type
(server), the field name (name) and the default value. But there
are probably not enough occurrences across the whole code for this
to really matter.
This should be backported wherever it applies.
The function leaked one full buffer per invocation. Fix this by simply removing
the call to alloc_trash_chunk(), the static chunk from get_trash_chunk() is
sufficient.
This bug was introduced in 0a72f5ee7c, which is
2.5-dev10. This fix needs to be backported to 2.5+.
When a response is validated, the query domain name is checked to be sure it
is the same than the one requested. When an error is reported, the wrong
goto label was used. Thus, the error was lost. Instead of
RSLV_RESP_WRONG_NAME, RSLV_RESP_INVALID was reported.
This bug was introduced by the commit c1699f8c1 ("MEDIUM: resolvers: No
longer store query items in a list into the response").
This patch should fix the issue #1473. No backport is needed.
If H1 headers are not fully received at once, the parsing is restarted a
last time when all headers are finally received. When this happens, the h1m
flags are sanitized to remove all value set during parsing.
But some flags where erroneously preserved. Among others, H1_MF_TE_CHUNKED
flag was not removed, what could lead to parsing error.
To fix the bug and make things easy, a mask has been added with all flags
that must be preserved. It will be more stable. This mask is used to
sanitize h1m flags.
This patch should fix the issue #1469. It must be backported to 2.5.
For each new log forward section, the proxy was added to the log forward
proxy list but the ref on the previous log forward section's proxy was
scratched using "init_new_proxy" which performs a memset. After configuration
parsing this list contains only the last section's proxy.
The post processing walk through this list to resolve "ring" names.
Since some section's proxies are missing in this list, the resolving
is not done for those ones and the pointer on the ring is kept to null
causing a segfault at runtime trying to write a log message
into the ring.
This patch shift the "init_new_proxy" before adding the ref on the
previous log forward section's proxy on currently parsed one.
This patch shoud fix github issue #1464
This patch should be backported to 2.3
When the response is parsed, query items are stored in a list, attached to
the parsed response (resolve_response).
First, there is one and only one query sent at a time. Thus, there is no
reason to use a list. There is a test to be sure there is only one query
item in the response. Then, the reference on this query item is only used to
validate the domain name is the one requested. So the query list can be
removed. We only expect one query item, no reason to loop on query records.
In addition, the query domain name is now immediately checked against the
resolution domain name. This way, the query item is only manipulated during
the response parsing.
When a new response is parsed, it is unexpected to have an old query item
still attached to the resolution. And indeed, when the response is parsed
and validated, the query item is detached and used for a last check on its
dname. However, this is only true for a valid response. If an error is
detected, the query is not detached. This leads to undefined behavior (most
probably a crash) on the next response because the first element in the
query list is referencing an old response.
This patch must be backported as far as 2.0.
During post-parsing stage, the SSL context of a server is initialized if SSL
is configured on the server or its default-server. It is required to be able
to enable SSL at runtime. However a regression was introduced, because the
last parsed default-server is used. But it is not necessarily the
default-server line used to configure the server. This may lead to
erroneously initialize the SSL context for a server without SSL parameter or
the skip it while it should be done.
The problem is the default-server used to configure a server is not saved
during configuration parsing. So, the information is lost during the
post-parsing. To fix the bug, the SRV_F_DEFSRV_USE_SSL flag is
introduced. It is used to know when a server was initialized with a
default-server using SSL.
For the record, the commit f63704488e ("MEDIUM: cli/ssl: configure ssl on
server at runtime") has introduced the bug.
This patch must be backported as far as 2.4.
Add pointer to counters as a member for h1c structure. This pointer is
initialized on h1_init function. This is useful to quickly access and
manipulate the counters inside every h1 functions.
Info about the request and the response parsers are now displayed in H1
traces for advanced and complete verbosity only. This should help debugging.
This patch may be backported as far as 2.4.
Splicing was disabled fo Messages with an unknown length (no C-L or T-E
header) with no valid reason. So now, it is possible to use the kernel
splicing for such messages.
This patch should be backported as far as 2.4.
Since the 2.4.4, the splicing support in the H1 multiplexer is buggy because
end of the message is not properly detected.
On the 2.4, when the requests is spliced, there is no issue. But when the
response is spliced, the client connection is always closed at the end of the
message. Note the response is still fully sent.
On the 2.5 and higher, when the last requests on a connection is spliced, a
client abort is reported. For other requests there is no issue. In all cases,
the requests are fully sent. When the response is spliced, the server connection
hangs till the server timeout and a server abort is reported. The response is
fully sent with no delay.
The root cause is the EOM block suppression. There is no longer extra block to
be sure to call a last time rcv_buf()/snd_buf() callback functions. At the end,
to fix the issue, we must now detect end of the message in rcv_pipe() and
snd_pipe() callback functions. To do so, we rely on the announced message length
to know when the payload is finished. This works because the chunk-encoded
messages are not spliced.
This patch must be backported as far as 2.4 after an observation period.
Apple libmalloc has its own notion of memory arenas as malloc_zone with
rich API having various callbacks for various allocations strategies but
here we just use the defaults.
In trim_all_pools, we advise to purge each zone as much as possible, called "greedy" mode.
In commit 3a4bedccc6 the variable logic was changed. Instead of
accessing variables by their name during runtime, the variable tables
are now indexed by a hash of the name. But the set-var and unset-var
converters try to access the correct variable by calculating a hash on
the sample instead of the already calculated variable hash.
It should be backported to 2.5.
As soon as the connection ID (the one choosen by the QUIC server) has been used
by the client, we can delete its original destination connection ID from its tree.
This patch modifies ha_quic_set_encryption_secrets() to store the
secrets received by the TLS stack and prepare the information for the
next key update thanks to quic_tls_key_update().
qc_pkt_decrypt() is modified to check if we must used the next or the
previous key phase information to decrypt a short packet.
The information are rotated if the packet could be decrypted with the
next key phase information. Then new secrets, keys and IVs are updated
calling quic_tls_key_update() to prepare the next key phase.
quic_build_packet_short_header() is also modified to handle the key phase
bit from the current key phase information.
This function derives the next RX and TX keys and IVs from secrets
for the next key update key phase. We also implement quic_tls_rotate_keys()
which rotate the key update key phase information to be able to continue
to decrypt old key phase packets. Most of these information are pointers
to unsigned char.
quic_tls_derive_keys() is responsible to derive the AEAD keys, IVs and$
header protection key from a secret provided by the TLS stack. We want
to make the derivation of the header protection key be optional. This
is required for the Key Update process where there is no update for
the header protection key.
When running Key Update process, we must maintain much information
especially when the key phase bit has been toggled by the peer as
it is possible that it is due to late packets. This patch adds
quic_tls_kp new structure to do so. They are used to store
previous and next secrets, keys and IVs associated to the previous
and next RX key phase. We also need the next TX key phase information
to be able to encrypt packets for the next key phase.
haproxy may crash when running this statement in qc_lstnr_pkt_rcv():
conn_ctx = qc->conn->xprt_ctx;
because qc->conn may not be initialized. With this patch we ensure
qc->conn is correctly initialized before accessing its ->xprt_ctx
members. We zero the xrpt_ctx structure (ssl_conn_ctx struct), then
initialize its ->conn member with HA_ATOMIC_STORE. Then, ->conn and
->conn->xptr_ctx members of quic_conn struct can be accessed with HA_ATOMIC_LOAD()
If the ClientHello callback does not manage to find a correct QUIC transport
parameters extension, we immediately close the connection with
missing_extension(109) as TLS alert which is turned into 0x16d QUIC connection
error.
When sending a CONNECTION_CLOSE frame to immediately close the connection,
do not provide CRYPTO data to the TLS stack. Do not built anything else than a
CONNECTION_CLOSE and do not derive any secret when in immediately close state.
Seize the opportunity of this patch to rename ->err quic_conn struct member
to ->error_code.
We set this TLS error when no application protocol could be negotiated
via the TLS callback concerned. It is converted as a QUIC CRYPTO_ERROR
error (0x178).
Commit b1f29bc62 ("MINOR: activity/fd: remove the dead_fd counter") got
rid of FD_UPDT_DEAD, but evports managed to slip through the cracks and
wasn't cleaned up, thus it doesn't build anymore, as reported in github
issue #1467. We just need to remove the related lines since the situation
is already handled by the remaining conditions.
Thanks to Dominik Hassler for reporting the issue and confirming the fix.
This must be backported to 2.5 only.
The proxy used by the master CLI is an internal proxy and no filter are
registered on it. Thus, there is no reason to take care to set or unset
filter analyzers in the master CLI analyzers. AN_REQ_FLT_END was set on the
request channel to prevent the infinite forward and be sure to be able to
process one commande at a time. However, the only work because
CF_FLT_ANALYZE flag was used by error as a channel analyzer instead of a
channel flag. This erroneously set AN_RES_FLT_END on the request channel,
that really prevent the infinite forward, be side effet.
In fact, We must avoid this kind of trick because this only work by chance
and may be source of bugs in future. Instead, we must always keep the CLI
request analyzer and add an early return if the response is not fully
processed. It happens when the CLI response analyzer is set.
This patch must be backported as far as 2.0.
The build broke on Windows and MacOS after commit ed232148a ("MEDIUM:
pool: refactor malloc_trim/glibc and jemalloc api addition detections."),
because the extern+attribute(weak) combination doesn't result in a really
weak symbol and it causes an undefined symbol at link time.
Let's reserve this detection to ELF platforms. The runtime detection using
dladdr() remains used if defined.
No backport needed, this is purely 2.6.
Commit 67e371e ("BUG/MEDIUM: mworker: FD leak of the eventpoll in wait
mode") introduced a regression. Upon a reload it tries to deinit the
poller per thread, but no poll loop was initialized after loading the
configuration.
This patch fixes the issue by moving this part of the code in
mworker_reload(), since this function will be called only when the
poller is fully initialized.
This patch must be backported in 2.5.
Remove the verbosity set to 0 on quic_init_stdout_traces. This will
generate even more verbose traces on stdout with the default verbosity
of 1 when compiling with -DENABLE_QUIC_STDOUT_TRACES.
Implement a function quic_init_stdout_traces called at STG_INIT. If
ENABLE_QUIC_STDOUT_TRACES preprocessor define is set, the QUIC trace
module will be automatically activated to emit traces on stdout on the
developer level.
The main purpose for now is to be able to generate traces on the haproxy
docker image used for QUIC interop testing suite. This should facilitate
test failure analysis.
Support qpack header using a non-huffman encoded name in a litteral
field line with name reference.
This format is notably used by picoquic client and should improve
haproxy interop covering.
Change the way the CIDs are organized to rattach received packets DCID
to QUIC connection. This is necessary to be able to handle multiple DCID
to one connection.
For this, the quic_connection_id structure has been extended. When
allocated, they are inserted in the receiver CID tree instead of the
quic_conn directly. When receiving a packet, the receiver tree is
inspected to retrieve the quic_connection_id. The quic_connection_id
contains now contains a reference to the QUIC connection.
The comment is here to warn about a possible thread concurrence issue
when treating INITIAL packets from the same client. The macro unlikely
is added to further highlight this scarce occurence.
It is valid for a QUIC packet to contain a PADDING frame followed by
one or several other frames.
quic_parse_padding_frame() does not require change as it detect properly
the end of the frame with the first non-null byte.
This allow to use quic-go implementation which uses a PADDING-CRYPTO as
the first handshake packet.
Since 2.5, before re-executing in wait mode, the master can have a
working configuration loaded, with a eventpoll fd. This case was not
handled correctly and a new eventpoll FD is leaking in the master at
each reload, which is inherited by the new worker.
Must be backported in 2.5.
Since the wait mode is automatically executed after charging the
configuration, -sf was shown in argv[] with the previous PID, which is
normal, but also the current one. This is only a visual problem when
listing the processes, because -sf does not do anything in wait mode.
Fix the issue by removing the whole "-sf" part in wait mode, but the
executed command can be seen in the argv[] of the latest worker forked.
Must be backported in 2.5.
HAProxy is documented to support gcc >= 3.4 as per INSTALL file, however
hlua.c makes use of c11 only loop initial declarations leading to build
failure when using gcc-4.9.4:
x86_64-unknown-linux-gnu-gcc -Iinclude -Wchar-subscripts -Wcomment -Wformat -Winit-self -Wmain -Wmissing-braces -Wno-pragmas -Wparentheses -Wreturn-type -Wsequence-point -Wstrict-aliasing -Wswitch -Wtrigraphs -Wuninitialized -Wunknown-pragmas -Wunused-label -Wunused-variable -Wunused-value -Wpointer-sign -Wimplicit -pthread -fdiagnostics-color=auto -D_LARGEFILE_SOURCE -D_FILE_OFFSET_BITS=64 -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -O3 -msse -mfpmath=sse -march=core2 -g -fPIC -g -Wall -Wextra -Wundef -Wdeclaration-after-statement -fwrapv -Wno-unused-label -Wno-sign-compare -Wno-unused-parameter -Wno-clobbered -Wno-missing-field-initializers -Wtype-limits -DUSE_EPOLL -DUSE_NETFILTER -DUSE_PCRE2 -DUSE_PCRE2_JIT -DUSE_POLL -DUSE_THREAD -DUSE_BACKTRACE -DUSE_TPROXY -DUSE_LINUX_TPROXY -DUSE_LINUX_SPLICE -DUSE_LIBCRYPT -DUSE_CRYPT_H -DUSE_GETADDRINFO -DUSE_OPENSSL -DUSE_LUA -DUSE_ACCEPT4 -DUSE_SLZ -DUSE_CPU_AFFINITY -DUSE_TFO -DUSE_NS -DUSE_DL -DUSE_RT -DUSE_PRCTL -DUSE_THREAD_DUMP -DUSE_PCRE2 -DPCRE2_CODE_UNIT_WIDTH=8 -I/usr/local/include -DCONFIG_HAPROXY_VERSION=\"2.5.0\" -DCONFIG_HAPROXY_DATE=\"2021/11/23\" -c -o src/connection.o src/connection.c
src/hlua.c: In function 'hlua_config_prepend_path':
src/hlua.c:11292:2: error: 'for' loop initial declarations are only allowed in C99 or C11 mode
for (size_t i = 0; i < 2; i++) {
^
src/hlua.c:11292:2: note: use option -std=c99, -std=gnu99, -std=c11 or -std=gnu11 to compile your code
This commit moves loop iterator to an explicit declaration.
Must be backported to 2.5 because this issue was introduced in
v2.5-dev10~69 with commit 9e5e586e35 ("BUG/MINOR: lua: Fix lua error
handling in `hlua_config_prepend_path()`")
With the master worker, the seamless reload was still requiring an
external stats socket to the previous process, which is a pain to
configure.
This patch implements a way to use the internal socketpair between the
master and the workers to transfer the sockets during the reload.
This way, the master will always try to transfer the socket, even
without any configuration.
The master will still reload with the -x argument, followed by the
sockpair@ syntax. ( ex -x sockpair@4 ). Which use the FD of internal CLI
to the worker.
Since internal proxies are now in the global proxy list, they are now
reachable from core.proxies, core.backends, core.frontends.
This patch fixes the issue by checking the PR_CAP_INT flag before
exposing them in lua, so the user can't have access to them.
This patch must be backported in 2.5.
This patch allows to replace the host header generated by the
httpclient instead of adding a new one, resulting in the server replying
an error 400.
The host header is now generated from the uri only if it wasn't found in
the list of headers.
Also add a new request in the VTC file to test this.
This patch must be backported in 2.5.
A regression was introduced in the commit da91842b6 ("BUG/MEDIUM: cache/cli:
make "show cache" thread-safe"). When cli_io_handler_show_cache() is called,
only one node is retrieved and is used to fill the output buffer in loop.
Once set, the "node" variable is never renewed. At the end, all nodes are
dumped but each one is duplicated several time into the output buffer.
This patch must be backported everywhere the above commit is. It means only
to 2.5 and 2.4.
__ssl_sock_load_new_ckch_instance() does not free correctly the SNI in
the session cache, it only frees the one in the current tid.
This bug was introduced with e18d4e8 ("BUG/MEDIUM: ssl: backend TLS
resumption with sni and TLSv1.3").
This fix must be backported where the mentionned commit was backported.
(all maintained versions).
SSL counters were added with commit d0447a7c3 ("MINOR: ssl: add counters
for ssl sessions") in 2.4, but their updates were not atomic, so it's
likely that under significant loads they are not correct.
This needs to be backported to 2.4.
Since recent 2.5 commit c8cac04bd ("MEDIUM: listener: deprecate "process"
in favor of "thread" on bind lines"), the "process" bind keyword may
report a warning. However some parts like the "stats socket" parser
will call such bind keywords and do not expect to face warnings, so
this will instantly cause a fatal error to be reported. A concrete
effect is that "stats socket ... process 1" will hard-fail indicating
the keyword is deprecated and will be removed in 2.7.
We must relax this test, but the code isn't designed to report warnings,
it uses a single string and only supports reporting an error code (-1).
This patch makes a special case of the ERR_WARN code and uses ha_warning()
to report it, and keeps the rest of the existing error code for other
non-warning codes. Now "process" on the "stats socket" is properly
reported as a warning.
No backport is needed.
The SHOW_TOT() and SHOW_AVG() macros used in cli_io_handler_show_activity()
produce a warning on gcc 4.7 on MIPS with threads disabled because the
compiler doesn't know that global.nbthread is necessarily non-null, hence
that at least one iteration is performed. Let's just change the loop for
a do {} while () that lets the compiler know it's always initialized. It
also has the tiny benefit of making the code shorter.
The shctx code relies on sensitive conditions that are hard to infer
from the code itself, let's add some BUG_ON() to verify them. They
helped spot the previous bugs.
In shctx_row_reserve_hot() we only leave if we've found the exact
requested size instead of at least as large, as is documented. This
results in extra lookups and free calls in the avail loop while it is
not needed, and participates to seeing a negative data_len early as
spotted in previous bugs.
It doesn't seem to have any other impact however, but it's better to
backport it to stable branches.
In shctx_row_reserve_hot(), a missing break allows the avail loop to
loop for a while after having allocated the required blocks, possibly
leading to the point where it could trigger the watchdog after checking
up to 2 million blocks. In addition, the extra iteration may leave one
block assigned with size zero at the head of the avail list, and mark
it as being an isolated chain of 1 block. It's unclear whether this
could have had other consequences.
There is a non-negligible chance that it addreses bugs #1451 and #1284,
as the pattern observed in the loop looks exactly the same as the one
reported there in the crashes.
It's only marked medium because it is extremely hard to trigger. Here
the conditions were reproduced when starting 4k connections at once
requesting objects of random sizes between 0 and 20k to store them into
a small 1MB cache. However the watchdog will never trigger in such a case
so one needs to instrument the functions.
Thanks to Sohaib Ahmad and @g0uZ for providing useful traces.
This will need to be backported to all stable branches.
The "show cache" command restarts from the previous node to look for a
duplicate key, but does this after having released the lock, so under
high write load, the node has many chances of having been reassigned
and the dereference of the node crashes after a few iterations. Since
the keys are unique anyway, there's no point looking for a dup, so
let's just continue from the next value.
This is only marked as medium as it seems to have been there for a
while, and discovering it that late simply means that nobody uses that
command, thus in practice it has a very limited impact on real users.
This should be backported to all stable versions.
A warning is triggered by gcc9 on this code path, which is the compiler
version used by ubuntu20.04 on the github CI.
This is linked to github issue #1445.
When receiving Initial packets for Version Negotiation, no quic_conn is
instantiated. Thus, on the final trace, the quic_conn dereferencement
must be tested before using it.
This simple patch add the parsing support for theses frames. But nothing is
done at this time about the streams or flow control concerned. This is only to
prevent some QUIC tracker or interop runner tests from failing for a reason
independant of their tested features.
When we have already received ACK frames with the same largest packet
number, this is not an error at all. In this case, we must continue
to parse the ACK current frame.
Add ->err member to quic_conn struct to store the connection errors.
This is the responsability of ->send_alert callback of SSL_QUIC_METHOD
struct to handle the TLS alert and consequently update ->err value.
At this time, when entering qc_build_pkt() we build a CONNECTION_CLOSE
frame close the connection when ->err value is not null.
When adding a range, if no "lower" range was present in the ack range root for
the packet number space concerned, we did not check if the new added range could
overlap the next one. This leaded haproxy to crash when encoding negative integer
when building ACK frames.
This bug was revealed thanks to "multi_packet_client_hello" QUIC tracker
test which makes a client send two first Initial packets out of order.
->qc (QUIC connection) member of packet structure were badly initialized
when received as second Initial packet (from picoquic -Q for instance).
This leaded to corrupt the quic_conn structure with random behaviors
as size effects. This bug came with this commit:
"MINOR: quic: Possible wrong connection identification"
If we want to run quic-tracker against haproxy, we must at least
support the draft version of the TLS extension for the QUIC transport
parameters (0xffa5). quic-tracker QUIC version is draft-29 at this time.
We select this depending on the QUIC version. If draft, we select the
draft TLS extension.
UDP datagrams with Initial packet were padded only for the clients (haproxy
servers). But such packets MUST also be padded for the servers (haproxy
listeners). Furthere, for servers, only UDP datagrams containing ack-eliciting
Initial packet must be padded.
A client may send several Initial packets. This is the case for picoquic
with -Q option. In this case we must identify the connection of incoming
Initial packets thanks to the original destination connection ID.
When allocating destination addresses for QUIC connections we did not set
this flag which denotes these addresses have been set. This had as side
effect to prevent the H3 request results from being returned to the QUIC clients.
Note that this bug was revealed by this commit:
"MEDIUM: backend: Rely on addresses at stream level to init server connection"
Thanks to Christopher for having found the real cause of this issue.
During 2.4-dev, an issue with partial frames was fixed with commit
3d4631fec ("BUG/MEDIUM: mux-h2: fix read0 handling on partial frames").
However this patch is not completely correct. It makes h2_recv() return
0 if the connection was shut for reads, but this not make h2_io_cb()
call h2_process(), so if there are any pending data left in the demux
buffer, they will never be processed, and the I/O callback will be
called in loops forever from the poller.
The correct return value there is 1, as is done at the end of the
function to report a pending read0.
This should definitely fix issue #1328. However even after a lot of
tests I couldn't manage to reproduce it, the conditions to enter that
situation are quite racy.
This must be backported to 2.0 since the fix above was merged into
2.0.21 and 2.2.9.
Since commit c2aae74 ("MEDIUM: ssl: Handle early data with OpenSSL
1.1.1"), the codepath of the clientHello callback changed, letting an
unknown SNI escape with a 'return 1' instead of passing through the
abort label.
An error was still emitted because the frontend continued the handshake
with the initial_ctx, which can't be used to achieve an handshake.
However, it had the ugly side effect of letting the request pass in the
case of a TLS resume. Which could be surprising when combining strict-sni
with the removing of a crt-list entry over the CLI for example. (like
its done in the ssl/new_del_ssl_crlfile.vtc reg-test).
This patch switches the code path of the allow_early and abort label, so
the default code path is the abort one, letting the clientHello returns
the correct SSL_AD_UNRECOGNIZED_NAME in case of errors.
Which means the client will now receive:
OpenSSL error[0x14094458] ssl3_read_bytes: tlsv1 unrecognized name
Instead of:
OpenSSL error[0x14094410] ssl3_read_bytes: sslv3 alert handshake failure
Which was the error emitted before HAProxy 1.8.
This patch must be carrefuly backported as far as 1.8 once we validated
its impact.
When establishing an outboud connection, haproxy checks if the cached
TLS session has the same SNI as the connection we are trying to
resume.
This test was done by calling SSL_get_servername() which in TLSv1.2
returned the SNI. With TLSv1.3 this is not the case anymore and this
function returns NULL, which invalidates any outboud connection we are
trying to resume if it uses the sni keyword on its server line.
This patch fixes the problem by storing the SNI in the "reused_sess"
structure beside the session itself.
The ssl_sock_set_servername() now has a RWLOCK because this session
cache entry could be accessed by the CLI when trying to update a
certificate on the backend.
This fix must be backported in every maintained version, however the
RWLOCK only exists since version 2.4.
Sometimes it is really useful to be able to specify a default value for
an optional environment variable, like the ${name-value} construct in
shell. In fact we're really missing this for a number of settings in
reg tests, starting with timeouts.
This commit simply adds support for the common syntax above. Other
common forms like '+' to replace existing variables, or ':-' and ':+'
to act on empty variables, were not implemented at this stage, as they
are less commonly needed.
The else is not for boringSSL but for the lack of Client Hello callback.
Should have been changed in 1fc44d4 ("BUILD: ssl: guard Client Hello
callbacks with HAVE_SSL_CLIENT_HELLO_CB macro instead of openssl
version").
Could be backported in 2.4.
Previously, the cleanup of the listeners was done in mworker_loop(),
which was called once the configuration file was parsed. HAProxy was
switching in wait mode when the configuration failed to load, so no
listeners where created.
Since the latest change on the mworker mode, HAProxy switch to wait mode
after successfuly loading the configuration, without cleaning its
listeners, because it was done in mworker_loop, resulting in the master
not closing its listeners and keeping them. The master needs its
configuration to know which listeners it need to close, so that must be
done before the exec().
This patch fixes the problem by cleaning the listeners in the
mworker_reexec() function.
No backport needeed.
If the client announced a QUIC version not supported by haproxy, emit a
Version Negotiation Packet, according to RFC9000 6. Version Negotiation.
This is required to be able to use the framework for QUIC interop
testing from https://github.com/marten-seemann/quic-interop-runner. The
simulator checks that the server is available by sending packets to
force the emission of a Version Negotiation Packet.
Implement a new app_ops layer for quic interop. This layer uses HTTP/0.9
on top of QUIC. Implementation is minimal, with the intent to be able to
pass interoperability test suite from
https://github.com/marten-seemann/quic-interop-runner.
It is instantiated if the negotiated ALPN is "hq-interop".
Remove the hardcoded initialization of h3 layer on mux init. Now the
ALPN is looked just after the SSL handshake. The app layer is then
installed if the ALPN negotiation returned a supported protocol.
This required to add a get_alpn on the ssl_quic layer which is just a
call to ssl_sock_get_alpn() from ssl_sock. This is mandatory to be able
to use conn_get_alpn().
This change is required to be able to use multiple app_ops layer on top
of QUIC. The stream-interface will now call the mux snd_buf which is
just a proxy to the app_ops snd_buf function.
The architecture may be simplified in the structure to install the
app_ops on the stream_interface and avoid the detour via the mux layer
on the sending path.
When receiving an unknown h3 frame type, the frame must be discarded
silently and the processing of the remaing frames must continue. This is
according to the HTTP/3 draft34.
This issue was detected when using the quiche client which uses GREASE
frame to test interoperability.
The commit a85c522d4 ("BUG/MINOR: mux-h1: Save shutdown mode if the shutdown
is delayed") revealed several hidden bugs in connection's shutdown
handling. One of them is about delayed silent shudown.
If outgoing data are not fully sent, we delayed the shutdown. However, in
h1_process(), only normal (or clean) shutdown are really detected. If a
silent (or dirty) shutdown is performed, the H1 connection is not
immediately released. Of course, in this situation, the client never
acknowledged the shutdown. Thus, the H1 connection remains open till the
client timeout.
This patch should fix the issues #1448 and #1453. It must be backported as
far as 2.0.
When a log message is emitted, The session's listener is always defined when
the session's owner is an inbound connection while it is undefined for a
health-check. It is not obvious. So, comments have been added to make it
clear.
This patch is related to the issue #1434.
When an ipv6 key is used to filter a CLI command on a stick table
(clear/set/show table ...), the return value of inet_pton() call must be
checked to be sure the key is valid.
This patch should fix the issue #1163. It should be backported to all
supported versions.
When haproxy is built with DEBUG_UAF=1, some particularly slow
allocation functions are used for each pool, and it was not uncommon
to see the watchdog trigger during performance tests. For this reason
the allocation functions were surrounded by a pair of thread_harmless
calls to mention that the function was waiting in slow syscalls. The
problem is that this also releases functions blocked in thread_isolate()
which can then start their work.
In order to protect against the accidental removal of a shared resource
in this situation, in 2.5-dev4 with commit ba3ab7907 ("MEDIUM: servers:
make the server deletion code run under full thread isolation") was added
thread_isolate_full() for functions which want to be totally protected
due to being manipulating some data.
But this is not sufficient, because there are still places where we
can allocate/free (thus sleep) under a lock, such as in long call
chains involving the release of an idle connection. In this case, if
one thread asks for isolation, one thread might hang in
pool_alloc_area_uaf() with a lock held (for example the conns_lock
when coming from conn_backend_get()->h1_takeover()->task_new()), with
another thread blocked on a lock waiting for that one to release it,
both keeping their bit clear in the thread_harmless mask, preventing
the first thread from being released, thus causing a deadlock.
In addition to this, it was already seen that the "show fd" CLI handler
could wake up during a pool_free_area_uaf() with an incompletely
released memory area while deleting a file descriptor, and be fooled
showing bad pointers, or during a pool_alloc() on another thread that
was in the process of registering a freshly allocated connection to a
new file descriptor.
One solution could consist in replacing all thread_isolate() calls by
thread_isolate_full() but then that makes thread_isolate() useless
and only shifts the problem by one slot.
A better approach could possibly consist in having a way to mark that
a thread is entering an extremely slow section. Such sections would
be timed so that this is not abused, and the bit would be used to
make the watchdog more patient. This would be acceptable as this would
only affect debugging.
The approach used here for now consists in removing the harmless bits
around the UAF allocator, thus essentially undoing commit 85b2cae63
("MINOR: pools: make the thread harmless during the mmap/munmap
syscalls").
This is marked as minor because nobody is expected to be running with
DEBUG_UAF outside of development or serious debugging, so this issue
cannot affect regular users. It must be backported to stable branches
that have thread_harmless_now() around the mmap() call.
The value for H2_CF_DEM_SHORT_READ flag is wrong. 2 bits are erroneously
set, 0x200 and 0x80000. It is not an issue because both bits are not used
anywhere else.
The typo was introduced in the commit b5f7b5296 ("BUG/MEDIUM: mux-h2: Handle
remaining read0 cases on partial frames"). Thus this patch must also be
backported as far a 2.0.
httpclient_new() sets the hc->req.uri ist without duplicating its
memory, which is a problem since the string in the ist could be
inaccessible at some point. The API was made to use a ist which was
allocated dynamically, but httpclient_new() didn't do that, which result
in a crash when calling istfree().
This patch fixes the problem by doing an istdup()
Fix issue #1452.
When in wait mode, the mworker-prog postparser is launched, but
unfortunately the child structure doesn't contain all required
information to be able to launch the test.
This test is only required when doing a configuration parsing.
Must be backported as far as 2.0.
Since the wait mode is always used once we successfuly loaded the
configuration, every processes were marked as old workers.
To fix this, the PROC_O_LEAVING flag is set only on the processes which
have a number of reloads greater than the current processes.
The ReloadFailed prompt in the master CLI is shown only when
failedreloads > 0. It was previously using a check on the wait mode, but
we always use the wait mode now.
Implement a reload failure counter which counts the number of failure
since the last success. This counter is available in 'show proc' over
the master CLI.
Clarify the startup and reload messages:
On a successful configuration load, haproxy will emit "Loading success."
after successfuly forked the children.
When it didn't success to load the configuration it will emit "Loading failure!".
When trying to reload the master process, it will emit "Reloading
HAProxy".
Use the waitpid mode after successfully loading the configuration, this
way the memory will be freed in the master, and will preserve the memory.
This will be useful when doing a reload with a configuration which has
large maps or a lot of SSL certificates, avoiding an OOM because too
much memory was allocated in the master.
nbproc was removed, it's time to remove any reference to the relative
PID in the master-worker, since there can be only 1 current haproxy
process.
This patch cleans up the alerts and warnings emitted during the exit of
a process, as well as the "show proc" output.
This reverts commit 597909f4e6
http-after-response rules evaluation was changed to do the same that was
done for http-response, in the code. However, the opposite must be performed
instead. Only the rules of the current section must be stopped. Thus the
above commit is reverted and the http-response rules evaluation will be
fixed instead.
Note that only "allow" action is concerned. It is most probably an uncommon
action for an http-after-request rule.
This patch must be backported as far as 2.2 if the above commit was
backported.
A TCP/HTTP action can stop the rules evaluation. However, it should be
applied on the current section only. For instance, for http-requests rules,
an "allow" on a frontend must stop evaluation of rules defined in this
frontend. But the backend rules, if any, must still be evaluated.
For http-response rulesets, according the configuration manual, the same
must be true. Only "allow" action is concerned. However, since the
beginning, this action stops evaluation of all remaining rules, not only
those of the current section.
This patch may be backported to all supported versions. But it is not so
critical because the bug exists since a while. I doubt it will break any
existing configuration because the current behavior is
counterintuitive.
- add new metric: `haproxy_backend_agg_server_check_status`
it counts the number of servers matching a specific check status
this permits to exclude per server check status as the usage is often
to rely on the total. Indeed in large setup having thousands of
servers per backend the memory impact is not neglible to store the per
server metric.
- realign promex_str_metrics array
quite simple implementation - we could improve it later by adding an
internal state to the prometheus exporter, thus to avoid counting at
every dump.
this patch is an attempt to close github issue #1312. It may bebackported
to 2.4 if requested.
Signed-off-by: William Dauchy <wdauchy@gmail.com>
The httpclient uses channel_add_input() to notify the channel layer that
it must forward some data. This function was used with b_data(&req->buf)
which ask to send the size of a buffer (because of the HTX metadata
which fill the buffer completely).
This is wrong and will have the consequence of trying to send data that
doesn't exist, letting HAProxy looping at 100% CPU.
When using htx channel_add_input() must be used with the size of the htx
payload, and not the size of a buffer.
When sending the request payload it also need to sets the buffer size to
0, which is achieved with a htx_to_buf() when the htx payload is empty.
This patch fixes the receive part of the lua httpclient when no payload
was sent.
The lua task was not awoken once it jumped into
hlua_httpclient_rcv_yield(), which caused the lua client to freeze.
It works with a payload because the payload push is doing the wakeup.
A change in the state machine of the IO handler is also require to
achieve correctly the change from the REQ state to the RES state, it has
to detect if there is the right EOM flag in the request.
When "max-age" or "s-maxage" receive their values in quotes, the pointer
to the integer to be parsed is advanced by one, but the error pointer
check doesn't consider this advanced offset, so it will not match a
parse error such as max-age="a" and will take the value zero instead.
This probably needs to be backported, though it's unsure it has any
effect in the real world.
This function claims to perform an strncat()-like operation but it does
not, it always copies the indicated number of bytes, regardless of the
presence of a NUL character (what is currently done by chunk_memcat()).
Let's remove it and explicitly replace it with chunk_memcat().
Fix potential allocation failure of HTX start-line during H3 request
decoding. In this case, h3_decode_qcs returns -1 as error code.
This addresses in part github issue #1445.
During a troublehooting it came obvious that the SNI always ought to
be logged on httpslog, as it explains errors caused by selection of
the default certificate (or failure to do so in case of strict-sni).
This expectation was also confirmed on the mailing list.
Since the field may be empty it appeared important not to leave an
empty string in the current format, so it was decided to place the
field before a '/' preceding the SSL version and ciphers, so that
in the worst case a missing field leads to a field looking like
"/TLSv1.2/AES...", though usually a missing element still results
in a "-" in logs.
This will change the log format for users who already deployed the
2.5-dev versions (hence the medium level) but no released version
was using this format yet so there's no harm for stable deployments.
The reg-test was updated to check for "-" there since we don't send
SNI in reg-tests.
Link: https://www.mail-archive.com/haproxy@formilux.org/msg41410.html
Cc: William Lallemand <wlallemand@haproxy.org>
Its definition is enclosed inside an ifdef SSL_CTRL_SET_TLSEXT_HOSTNAME
which is defined since OpenSSL 0.9.8. Having it conditioned like this
prevents us from using it by default in a log format, which could cause
an error on an old or exotic library.
Let's just always define it and make the sample fetch fail to return
anything on such libs instead.
Commit 3d2093af9 ("MINOR: connection: Add a connection error code sample
fetch") added these convenient sample-fetch functions but it appears that
due to a misunderstanding the redundant "conn" part was kept in their
name, causing confusion, since "fc" already stands for "front connection".
Let's simply call them "fc_err" and "bc_err" to match all other related
ones before they appear in a final release. The VTC they appeared in were
also updated, and the alpha sort in the keywords table updated.
Cc: William Lallemand <wlallemand@haproxy.org>
This directive is documented as being ignored if set in a defaults
section. But it is only mentionned in a small note in the configuration
manual. Thus, now, a warning is emitted. To do so, the errors handling in
parse_compression_options() function was slightly changed.
In addition, this directive is now documented apart from the other
compression directives. This way, it is clearly visible that it must not be
used in a defaults section.
In alloc_dst_address(), the client destination address must only be
retrieved when we are sure to use it. Most of time, this save a syscall to
getsockname(). It is not a bugfix in itself. But it revealed a bug in the
QUIC part. The CO_FL_ADDR_TO_SET flag is not set when the destination
address is create for anew quic client connection.
->frms_rwlock is an old lock supposed to be used when several threads
could handle the same connection. This is no more the case since this
commit:
"MINOR: quic: Attach the QUIC connection to a thread."
Add a buffer per QUIC connection. At this time the listener which receives
the UDP datagram is responsible of identifying the underlying QUIC connection
and must copy the QUIC packets to its buffer.
->pkt_list member has been added to quic_conn struct to enlist the packets
in the order they have been copied to the connection buffer so that to be
able to consume this buffer when the packets are freed. This list is locked
thanks to a R/W lock to protect it from concurent accesses.
quic_rx_packet struct does not use a static buffer anymore to store the QUIC
packets contents.
At this time we allocate an RX buffer by thread.
Also take the opportunity offered by this patch to rename TX related variable
names to distinguish them from the RX part.
jwt_parse_alg would mistakenly return JWT_ALG_NONE for algorithms "",
"n", "no" and "non" because of a strncmp misuse. It now sees them as
unknown algorithms.
No backport needed.
Cc: Tim Duesterhus <tim@bastelstu.be>
This patch renames all dns extra counters and stats functions, types and
enums using the 'resolv' prefix/suffixes.
The dns extra counter domain id used on cli was replaced by "resolvers"
instead of "dns".
The typed extra counter prefix dumping resolvers domain "D." was
also renamed "N." because it points counters on a Nameserver.
This was done to finish the split between "resolver" and "dns" layers
and to avoid further misunderstanding when haproxy will handle dns
load balancing.
This should not be backported.
This patch add a union and struct into dns_counter struct to split
application specific counters.
The only current existing application is the resolver.c layer but
in futur we could handle different application such as dns load
balancing with others specific counters.
This patch should not be backported.
Before this patch the sent error counter was increased
for each targeted nameserver as soon as we were unable to build
the query message into the trash buffer. But this counter is here
to count sent errors at dns.c transport layer and this error is not
related to a nameserver.
This patch stops to increase those counters and sent a log message
to signal the trash buffer size is not large enough to build the query.
Note: This case should not happen except if trash size buffer was
customized to a very low value.
The function was also re-worked to return -1 in this error case
as it was specified in comment. This function is currently
called at multiple point in resolver.c but return code
is still not yet handled. So to advert the user of the malfunction
the log message was added.
This patch should be backported on all versions including the
layer split between dns.c and resolver.c (v >= 2.4)
The sent messages counter was increased at both resolver.c and dns.c
layers.
This patch let the dns.c layer count the sent messages since this
layer handle a retry if transport layer is not ready (EAGAIN on udp
or tcp session ring buffer full).
This patch should be backported on all versions using a split of those
layers for resolving (v >=2.4)
Implement parsing for the server keyword 'ws'. This is used to configure
the mode of selection for websocket protocol. The configuration
documentation has been updated.
A new regtest has been created to test the proper behavior of the
keyword.
Handle properly websocket streams if the server uses an ALPN with both
h1 and h2. Add a new field h2_ws in the server structure. If set to off,
reuse is automatically disable on backend and ALPN is forced to http1.x
if possible. Nothing is done if on.
Implement a mechanism to be able to use a different http version for
websocket streams. A new server member <ws> represents the algorithm to
select the protocol. This can overrides the server <proto>
configuration. If the connection uses ALPN for proto selection, it is
updated for websocket streams to select the right protocol.
Three mode of selection are implemented :
- auto : use the same protocol between non-ws and ws streams. If ALPN is
use, try to update it to "http/1.1"; this is only done if the server
ALPN contains "http/1.1".
- h1 : use http/1.1
- h2 : use http/2.0; this requires the server to support RFC8441 or an
error will be returned by haproxy.
Add a new parameter force_mux_ops. This will be useful to specify an
alternative to the srv->mux_proto field. If non-NULL, it will be use to
force the mux protocol wether srv->mux_proto is set or not.
This argument will become useful to install a mux for non-standard
streams, most notably websocket streams.
Implement a new function to update the ALPN on an existing connection.
on an existing connection. The ALPN from the ssl context can be checked
to update the ALPN only if it is a subset of the context value.
This method will be useful to change a connection ALPN for websocket,
must notably if the server does not support h2 websocket through the
rfc8441 Extended Connect.
Define a new stream flag SF_WEBSOCKET and a new cs flag CS_FL_WEBSOCKET.
The conn-stream flag is first set by h1/h2 muxes if the request is a
valid websocket upgrade. The flag is then converted to SF_WEBSOCKET on
the stream creation.
This will be useful to properly manage websocket streams in
connect_server().
The RFC8441 was not respected by haproxy in regards with server support
for Extended CONNECT. The Extended CONNECT method was used to convert an
Upgrade header stream even if no SETTINGS_ENABLE_CONNECT_PROTOCOL was
received, which is forbidden by the RFC8441. In this case, the behavior
of the http/2 server is unspecified.
Fix this by flagging the connection on receiption of the RFC8441
settings SETTINGS_ENABLE_CONNECT_PROTOCOL. Extended CONNECT is thus only
be used if the flag is present. In the other case, the stream is
immediatly closed as there is no way to handle it in http/2. It results
in a http/1.1 502 or http/2 RESET_STREAM to the client side.
The protocol-upgrade regtest has been extended to test that haproxy does
not emit Extended CONNECT on servers without RFC8441 support.
It must be backported up to 2.4.
Add a state trace to report that a protocol upgrade is converted using
the rfc8441 Extended connect method. This is useful in regards with the
recent changes to improve http/2 websockets.
It is not useful to start a configuration where an invalid static string is
provided as the JWT algorithm. Better make the administrator aware of the
suspected typo by failing to start.
Session struct is already allocated when "tcp-request connection" rules
are evaluated so session-scoped variables turned out easy to support.
This resolves github issue #1408.
A long-standing issue was reported in issue #1215.
In short, var() was initially internally declared as returning a string
because it was not possible by then to return "any type". As such, users
regularly get trapped thinking that when they're storing an integer there,
then the integer matching method automatically applies. Except that this
is not possible since this is related to the config parser and is decided
at boot time where the variable's type is not known yet.
As such, what is done is that the output being declared as type string,
the string match will automatically apply, and any value will first be
converted to a string. This results in several issues like:
http-request set-var(txn.foo) int(-1)
http-request deny if { var(txn.foo) lt 0 }
not working. This is because the string match on the second line will in
fact compare the string representation of the variable against strings
"lt" and "0", none of which matches.
The doc says that the matching method is mandatory, though that's not
the case in the code due to that default string type being permissive.
There's not even a warning when no explicit match is placed, because
this happens very deep in the expression evaluator and making a special
case just for "var" can reveal very complicated.
The set-var() converter already mandates a matching method, as the
following will be rejected:
... if { int(12),set-var(txn.truc) 12 }
while this one will work:
... if { int(12),set-var(txn.truc) -m int 12 }
As such, this patch this modifies var() to match the doc, returning the
type "any", and mandating the matching method, implying that this bogus
config which does not work:
http-request set-var(txn.foo) int(-1)
http-request deny if { var(txn.foo) lt 0 }
will need to be written like this:
http-request set-var(txn.foo) int(-1)
http-request deny if { var(txn.foo) -m int lt 0 }
This *will* break some configs (and even 3 of our regtests relied on
this), but except those which already match string exclusively, all
other ones are already broken and silently fail (and one of the 3
regtests, the one on FIX, was bogus regarding this).
In order to fix existing configs, one can simply append "-m str"
after a "var()" in an ACL or "if" expression:
http-request deny unless { var(txn.jwt_alg) "ES" }
must become:
http-request deny unless { var(txn.jwt_alg) -m str "ES" }
Most commonly, patterns such as "le", "lt", "ge", "gt", "eq", "ne" in
front of a number indicate that the intent was to match an integer,
and in this case "-m int" would be desired:
tcp-response content reject if ! { var(res.size) gt 3800 }
ought to become:
tcp-response content reject if ! { var(res.size) -m int gt 3800 }
This must not be backported, but if a solution is found to at least
detect this exact condition in the generic expression parser and
emit a warning, this could probably help spot configuration bugs.
Link: https://www.mail-archive.com/haproxy@formilux.org/msg41341.html
Cc: Christopher Faulet <cfaulet@haproxy.com>
Cc: Tim Dsterhus <tim@bastelstu.be>
The kill list introduced in commit f766ec6b5 ("MEDIUM: resolvers: use a kill
list to preserve the list consistency") contains a bug. The deatch_row must
be initialized before calling resolv_process_responses() function. However,
this function is called for the dns code. The death_row is not visible from
the outside. So, it is possible to add a resolution in an uninitialized
death_row, leading to a crash.
But, with the current implementation, it is not possible to handle the
death_row in resolv_process_responses() function because, internally, the
kill list may be freed via a call to resolv_unlink_resolution(). At the end,
we are unable to determine all call chains to guarantee a safe use of the
kill list. It is a shameful observation, but unfortunatly true.
So, to make the fix simple, we track all calls to the public resolvers
api. A counter is incremented when we enter in the resolver code and
decremented when we leave it. This way, we are able to track the recursions
to init and release the kill list only once, at the edge.
Following functions are incrementing/decrementing the recurse counter:
* resolv_trigger_resolution()
* resolv_srvrq_expire_task()
* resolv_link_resolution()
* resolv_unlink_resolution()
* resolv_detach_from_resolution_answer_items()
* resolv_process_responses()
* process_resolvers()
* resolvers_finalize_config()
* resolv_action_do_resolve()
This patch should fix the issue #1404. It must be backported everywhere the
above commit was backported.
First of all, we must be careful here because this part was modified and
each time, this introduced a bug. But, in si_update_rx(), we must not
re-enables receives if the channel buffer cannot receive more
data. Otherwise the multiplexer will be wake up for nothing. Because the
stream is woken up when the multiplexer is waiting for more room to move on,
this may lead to a ping-pong loop between the stream and the mux.
Note that for now, it does not fix any known bug. All reported issues in
this area were fixed in another way.
This patch must be backported with a special care. Technically speaking, it
may be backported as far as 2.0.
A Host header must be present for http_update_host() to success.
htx_add_header(htx, ist("Host"), IST_NULL) was used but this is not a
good idea from a semantic point of view. It also tries to make a memcpy
with a len of 0, which is unrequired.
Use an ist("h") instead as a placeholder value.
This patch fixes bug #1439.
Some luaL_buffinit() call was done before the push of the variable name,
where it seems to work correctly with lua < 5.4.3, it brokes
systematically on this version.
This patch inverts the pushstring and the buffinit.
The http_auth_bearer sample fetch can take a header name as parameter,
in which case it will try to extract a Bearer value out of the given
header name instead of the default "Authorization" one. In this case,
the extraction would not have worked because of a misuse of strncasecmp.
This patch fixes this by replacing the standard string functions by ist
ones.
It also properly manages the multiple spaces that could be found between
the scheme and its value.
No backport needed, that's part of JWT which is only in 2.5.
Co-authored-by: Tim Duesterhus <tim@bastelstu.be>
As per RFC7235, there can be multiple spaces in the value of an
Authorization header, between the scheme and the actual authentication
parameters.
This can be backported to all stable versions since basic auth has almost
always been there.
When a tarpit action is performed, we must be sure to drain data from the
request channel. Otherwise, the mux on the frontend side may be blocked
because the request channel buffer is full.
This may lead to Two bugs. The first one is a HOL blocking on the H2
multiplexer. A tarpitted stream may block all the others because data are
not drained for the whole tarpit timeout. The second bug is a ping-pong loop
between the multiplexer and the stream. The mux is waiting for more space in
the channel buffer, so it wakes up the stream. And the stream systematically
re-enables receives.
This last part is not pretty clean and it will be addressed with another
fix. But draning request data is a good way to fix both bugs in same time.
This patch must be backported as far as 2.0. The legacy HTTP mode is
probably affected, but I don't know if same bugs may be experienced in this
mode.
When a requester is unlink from a resolution, by reading the code, we can
have this call chain:
_resolv_unlink_resolution(srv->resolv_requester)
resolv_detach_from_resolution_answer_items(resolution, requester)
resolv_srvrq_cleanup_srv(srv)
_resolv_unlink_resolution(srv->resolv_requester)
A loop on the resolution answer items is performed inside
resolv_detach_from_resolution_answer_items(). But by reading the code, it
seems possible to recursively unlink the same requester.
To avoid any loop at this stage, the requester clean up must be performed
before the call to resolv_detach_from_resolution_answer_items(). This way,
the second call to _resolv_unlink_resolution() does nothing and returns
immediately because the requester was already detached from the resolution.
This patch is related to the issue #1404. It must be backported as far as
2.2.
When the H1 connection is released, a connection shutdown is now performed.
If it was already performed when the stream was detached, this action has no
effect. But it is mandatory, when an idle H1C is released. Otherwise the
xprt and the socket shutdown is never perfmed. It is especially important
for SSL client connections, because it is the only way to perform a clean
SSL shutdown.
Without this patch, SSL_shutdown is never called, preventing, among other
things, the SSL session caching.
This patch depends on the commit "BUG/MINOR: mux-h1: Save shutdown mode if
the shutdown is delayed". It should be backported as far as 2.0.
The connection shutdown may be delayed if there are pending outgoing
data. The action is performed once data are fully sent. In this case the
mode (dirty/clean) was lost and a clean shutdown was always performed. Now,
the mode is saved to be sure to perform the connection shutdown using the
right mode. To do so, H1C_F_ST_SILENT_SHUT flag is introduced.
This patch should be backported as far as 2.0.
With this feature the lua implementation of the httpclient is now able
to stream a payload larger than an haproxy buffer.
The hlua_httpclient_send() function is now split into:
hlua_httpclient_send() which initiate the httpclient and parse the lua
parameters
hlua_httpclient_snd_yield() which will send the request and be called
again to stream the request if the body is larger than an haproxy buffer
hlua_httpclient_rcv_yield() which will receive the response and store it
in the lua buffer.
This patch add a way to handle HTTP requests streaming using a
callback.
The end of the data must be specified by using the "end" parameter in
httpclient_req_xfer().
The OpenSSL documentation (https://www.openssl.org/docs/man1.1.0/man3/HMAC.html)
specifies:
> It places the result in md (which must have space for the output of the hash
> function, which is no more than EVP_MAX_MD_SIZE bytes). If md is NULL, the
> digest is placed in a static array. The size of the output is placed in
> md_len, unless it is NULL. Note: passing a NULL value for md to use the
> static array is not thread safe.
`EVP_MAX_MD_SIZE` appears to be defined as `64`, so let's simply use a stack
buffer to avoid the whole memory management.
At a few places we were still using protocol_by_family() instead of
the richer protocol_lookup(). The former is limited as it enforces
SOCK_STREAM and a stream protocol at the control layer. At least with
protocol_lookup() we don't have this limitationn. The values were still
set for now but later we can imagine making them configurable on the
fly.
Instead of using sock_type and ctrl_type to select a protocol, let's
make use of the new protocol type. For now they always match so there
is no change. This is applied to address parsing and to socket retrieval
from older processes.
The protocol selection is currently performed based on the family,
control type and socket type. But this is often not enough, as both
only provide DGRAM or STREAM, leaving few variants. Protocols like
SCTP for example might be indistinguishable from TCP here. Same goes
for TCP extensions like MPTCP.
This commit introduces a new enum proto_type that is placed in each
and every protocol definition, that will usually more or less match
the sock_type, but being an enum, will support additional values.
The test on the sock_domain is a bit useless because the protocols are
registered at boot time, and the test silently fails and returns no
error. Use a BUG_ON() instead to make sure to catch such bugs in the
code if any.
When compiled without SSL support, a variable is reported as not used by
GCC.
src/log.c: In function ‘sess_build_logline’:
src/log.c:2056:36: error: unused variable ‘conn’ [-Werror=unused-variable]
2056 | struct connection *conn;
| ^~~~
This does not need to be backported.
Because source and destination address of the client connection are now
updated at the appropriated level (connection, session or stream), original
info about the client connection are preserved. src/src_port/src_is_local
and dst/dst_port/dst_is_local return current info about the client
connection. It is the info at the highest available level. Most of time, the
stream. Any tcp/http rules may alter this info.
To get original info, "fc_" prefix must be added. For instance
"fc_src". Here, only "tcp-request connection" rules may alter source and
destination address/port.
This patch was reverted because it was inconsitent to change connection
addresses at stream level. Especially in HTTP because all requests was
affected by this change and not only the current one. In HTTP/2, it was
worse. Several streams was able to change the connection addresses at the
same time.
It is no longer an issue, thanks to recent changes. With multi-level client
source and destination addresses, it is possible to limit the change to the
current request. Thus this patch can be reintroduced.
If it possible to set source IP/Port from "tcp-request connection",
"tcp-request session" and "http-request" rules but not from "tcp-request
content" rules. There is no reason for this limitation and it may be a
problem for anyone wanting to call a lua fetch to dynamically set source
IP/Port from a TCP proxy. Indeed, to call a lua fetch, we must have a
stream. And there is no stream when "tcp-request connection/session" rules
are evaluated.
Thanks to this patch, "set-src" and "set-src-port" action are now supported
by "tcp_request content" rules.
This patch is related to the issue #1303.
When client source or destination addresses are changed via a tcp/http
action, we update addresses at the appropriate level. When "tcp-request
connection" rules are evaluated, we update addresses at the connection
level. When "tcp-request session" rules is evaluated, we update those at the
session level. And finally, when "tcp-request content" or "http-request"
rules are evaluated, we update the addresses at the stream level.
The same is performed when source or destination ports are changed.
Of course, for now, not all level are supported. But thanks to this patch,
it will be possible.
Just like for the PROXY protocol, when the NetScaler Client IP insertion
header is received, the retrieved client source and destination addresses
are set at the session level. This leaves those at the connection level
intact.
When PROXY protocol line is received, the retrieved client source and
destination addresses are set at the session level. This leaves those at the
connection level intact.
Client source and destination addresses at stream level are used to initiate
the connections to a server. For now, stream-interface addresses are never
set. So, thanks to the fallback mechanism, no changes are expected with this
patch. But its purpose is to rely on addresses at the appropriate level when
set instead of those at the connection level.
If the stream exists, the frontend stream-interface is used to get the
client source and destination addresses when the proxy line is built. For
now, stream-interface or session addresses are never set. So, thanks to the
fallback mechanism, no changes are expected with this patch. But its purpose
is to rely on addresses at the appropriate level when set instead of those
at the connection level.
In src, src-port, dst and dst-port sample fetches, the client source and
destination addresses are retrieved from the appropriate level. It means
that, if the stream exits, we use the frontend stream-interface to get the
client source and destination addresses. Otherwise, the session is used. For
now, stream-interface or session addresses are never set. So, thanks to the
fallback mechanism, no changes are expected with this patch. But its purpose
is to rely on addresses at the appropriate level when set instead of those
at the connection level.
Client source and destination addresses at stream level are now used to emit
SERVER_NAME/SERVER_PORT and REMOTE_ADDR/REMOTE_PORT parameters. For now,
stream-interface addresses are never set. So, thanks to the fallback
mechanism, no changes are expected with this patch. But its purpose is to
rely on addresses at the stream level, when set, instead of those at the
connection level.
Client source and destination addresses at stream level are now used to
compute base32+src and url32+src hashes. For now, stream-interface addresses
are never set. So, thanks to the fallback mechanism, no changes are expected
with this patch. But its purpose is to rely on addresses at the stream
level, when set, instead of those at the connection level.
Client source and destination addresses at stream level are now used to emit
X-Forwarded-For and X-Original-To headers. For now, stream-interface addresses
are never set. So, thanks to the fallback mechanism, no changes are expected
with this patch. But its purpose is to rely on addresses at the stream level,
when set, instead of those at the connection level.
When an embryonic session is killed, if no log format is defined for this
error, a generic error is emitted. When this happens, we now rely on the
session to get the client source address. For now, session addresses are
never set. So, thanks to the fallback mechanism, no changes are expected
with this patch. But its purpose is to rely on addresses at the session
level when set instead of those at the connection level.
When a log message is emitted, if the stream exits, we use the frontend
stream-interface to retrieve the client source and destination
addresses. Otherwise, the session is used. For now, stream-interface or
session addresses are never set. So, thanks to the fallback mechanism, no
changes are expected with this patch. But its purpose is to rely on
addresses at the appropriate level when set instead of those at the
connection level.
For now, stream-interface or session addresses are never set. So, thanks to
the fallback mechanism, no changes are expected with this patch. But its
purpose is to rely on the client addresses at the stream level, when set,
instead of those at the connection level. The addresses are retrieved from
the frontend stream-interface.
For now, these addresses are never set. But the idea is to be able to set, at
least first, the client source and destination addresses at the stream level
without updating the session or connection ones.
Of course, because these addresses are carried by the strream-interface, it
would be possible to set server source and destination addresses at this level
too.
Functions to fill these addresses have been added: si_get_src() and
si_get_dst(). If not already set, these functions relies on underlying
layers to fill stream-interface addresses. On the frontend side, the session
addresses are used if set, otherwise the client connection ones are used. On
the backend side, the server connection addresses are used.
And just like for sessions and conncetions, si_src() and si_dst() may be used to
get source and destination addresses or the stream-interface. And, if not set,
same mechanism as above is used.
For now, these addresses are never set. But the idea is to be able to set
client source and destination addresses at the session level without
updating the connection ones.
Functions to fill these addresses have been added: sess_get_src() and
sess_get_dst(). If not already set, these functions relies on
conn_get_src() and conn_get_dst() to fill session addresses.
And just like for conncetions, sess_src() and sess_dst() may be used to get
source and destination addresses. However, if not set, the corresponding
address from the underlying client connection is returned. When this
happens, the addresses is filled in the connection object.
hlua_http_msg_get_body must return either a Lua string or nil. For some
HTTPMessage objects, HTX_BLK_EOT blocks are also present in the HTX buffer
along with HTX_BLK_DATA blocks. In such cases, _hlua_http_msg_dup will start
copying data into a luaL_Buffer until it encounters an HTX_BLK_EOT. But then
instead of pushing neither the luaL_Buffer nor `nil` to the Lua stack, the
function will return immediately. The end result will be that the caller of
the HTTPMessage.body() method from a Lua filter will see whatever object was
on top of the stack as return value. It may be either a userdata object if
HTTPMessage.body() was called with only two arguments, or the third argument
itself if called with three arguments. Hence HTTPMessage.body() would return
either nil, or HTTPMessage body as Lua string, or a userdata objects, or
number.
This fix ensure that HTTPMessage.body() will always return either a string
or nil.
Reviewed-by: Christopher Faulet <cfaulet@haproxy.com>
Add a check during the httpclient request generation which yield an lua
error when the generation didn't work. The most common case is the lack
of space in the buffer, it can because of too much headers or a too big
body.
Add support for HEAD/PUT/POST/DELETE method with the lua httpclient.
This patch use the httpclient_req_gen() function with a different meth
parameter to implement this.
Also change the reg-test to support a POST request with a body.
httpclient_req_gen() takes a payload argument which can be use to put a
payload in the request. This payload can only fit a request buffer.
This payload can also be specified by the "body" named parameter within
the lua. httpclient.
It is also used within the CLI httpclient when specified as a CLI
payload with "<<".
Remove the zeroing of an idle connection node on remove from a tree.
This is not needed and should improve slightly the performance of idle
connection usage. Besides, it breaks the memory poisoning feature.
Skip the hash connection calcul when reuse must not be used in
connect_server() : this is the case for TCP proxies. This should result
in slightly better performance when using this use-case.
In connect_server(), if http-reuse always is set, the backend connection
is inserted into the available tree as soon as created. However, the
hash connection field is only set later at the end of the function.
This seems to have no impact as the hash connection field is always
position before a lookup. However, this is not a proper usage of ebmb
API. Fix this by setting the hash connection field before the insertion
into the avail tree.
This must be backported up to 2.4.
Add traces in connect_server() to debug idle connection reuse. These
are attached to stream trace module, as it's already in use in
backend.c with the macro TRACE_SOURCE.
The current model causes an issue when trying to spot memory leaks,
because malloc(0) or realloc(0) do not count as allocations since we only
account for the application-usable size. This is the problem that made
issue #1406 not to appear as a leak.
What we're doing now is to account for one extra pointer (the one that
memory allocators usually place before the returned area), so that a
malloc(0) will properly account for 4 or 8 bytes. We don't need something
exact, we just need something non-zero so that a realloc(X) followed by a
realloc(0) without a free() gives a small non-zero result.
It was verified that the results are stable including in the presence
of lots of malloc/realloc/free as happens when stressing Lua.
It would make sense to backport this to 2.4 as it helps in bug reports.
realloc() calls are painful to analyse because they have two non-zero
columns and trying to spot a leaking one requires a bit of scripting.
Let's simply append the delta at the end of the line when alloc and
free are non-nul.
It would be useful to backport this to 2.4 to help with bug reports.
In issue #1406, Lev Petrushchak reported a nasty memory leak on Alpine
since haproxy 2.4 when using Lua, that memory profiling didn't detect.
After inspecting the code and Lua's code, it appeared that Lua's default
allocator does an explicit free() on size zero, while since 2.4 commit
d36c7fa5e ("MINOR: lua: simplify hlua_alloc() to only rely on realloc()"),
haproxy only calls realloc(ptr,0) that performs a free() on glibc but not
on other systems as it's not required by POSIX...
This patch reinstalls the explicit test for nsize==0 to call free().
Thanks to Lev for the very documented report, and to Tim for the links
to a musl thread on the same subject that confirms the diagnostic.
This must be backported to 2.4.
Some browsers may send Initial packets with sizes greater than 1252 bytes
(QUIC_INITIAL_IPV4_MTU). Let us increase this size limit up to 2048 bytes.
Also use this size for "max_udp_payload_size" transport parameter to limit
the size of the datagrams we want to receive.
In issue 1424 Coverity reports that the loop increment is unreachable,
which is true, the list_for_each_entry() was replaced with a for loop,
but it was already not needed and was instead used as a convenient
construct for a single iteration lookup. Let's get rid of all this
now and replace the loop with an "if" statement.
While in H1 we can usually close quickly, in H2 a client might be sending
window updates or anything while we're sending a GOAWAY and the pending
data in the socket buffers at the moment the close() is performed on the
socket results in the output data being lost and an RST being emitted.
One example where this happens easily is with h2spec, which randomly
reports connection resets when waiting for a GOAWAY while haproxy sends
it, as seen in issue #1422. With h2spec it's not window updates that are
causing this but the fact that h2spec has to upload the payload that
comes with invalid frames to accommodate various implementations, and
does that in two different segments. When haproxy aborts on the invalid
frame header, the payload was not yet received and causes an RST to
be sent.
Here we're dealing with this two ways:
- we perform a shutdown(WR) on the connection to forcefully push pending
data on a front connection after the xprt is shut and closed ;
- we drain pending data
- then we close
This totally solves the issue with h2spec, and the extra cost is very
low, especially if we consider that H2 connections are not set up and
torn down often. This issue was never observed with regular clients,
most likely because this pattern does not happen in regular traffic.
After more testing it could make sense to backport this, at least to
avoid reporting errors on h2spec tests.
Sometimes we'd like to do our best to drain pending data before closing
in order to save the peer from risking to receive an RST on close.
This adds a new connection flag CO_FL_WANT_DRAIN that is used to
trigger a call to conn_ctrl_drain() from conn_ctrl_close(), and the
sock_drain() function ignores fd_recv_ready() if this flag is set,
in order to catch latest data. It's not used for now.
Some checks were added by commit 9a3d3fcb5 ("BUG/MAJOR: mux-h2: Don't try
to send data if we know it is no longer possible") to make sure we don't
loop forever trying to send data that cannot leave. But one of the
conditions there is not correct, the one relying on H2_CS_ERROR2. Indeed,
this state indicates that the error code was serialized into the mux
buffer, and since the test is placed before trying to send the data to
the socket, if the connection states only contains a GOAWAY frame, it
may refrain from sending and may close without sending anything. It's
not dramatic, as GOAWAY reports connection errors in situations where
delivery is not even certain, but it's cleaner to make sure the error
is properly sent, and it avoids upsetting h2spec, as seen in github
issue #1422.
Given that the patch above was backported as far as 1.8, this patch will
also have to be backported that far.
Thanks to Ilya for reporting this one.
This applicationn specific flag was added in 2.4-dev by commit 6fa8bcdc7
("MINOR: task: add an application specific flag to the state: TASK_F_USR1")
to help preserve a the idle connections status across wakeup calls. While
the code to do this was OK for tasklets, it was wrong for tasks, as in an
effort not to lose it when setting the RUNNING flag (that tasklets don't
have), it ended up being inconditionally set. It just happens that for now
no regular tasks use it, only tasklets.
This fix makes sure we always atomically perform (state & flags | running)
there, using a CAS. It also does it for tasklets because it was possible
to lose some such flags if set by another thread, even though this should
not happen with current code. In order to make the code more readable (and
avoid the previous mistake of repeated flags in the bit field), a new
TASK_PERSISTENT aggregate was declared in task.h for this.
In practice the CAS is cheap here because task states are stable or
convergent so the loop will almost never be taken.
This should be backported to 2.4.
The crash that was fixed by commit 7045590d8 ("BUG/MAJOR: dns: attempt
to lock globaly for msg waiter list instead of use barrier") was now
completely analysed and confirmed to be partially a result of the
debugging code added to LIST_INLIST(), which was looking at both
pointers and their reciprocals, and that, if used in a concurrent
context, could perfectly return false if a neighbor was being added or
removed while the current one didn't change, allowing the LIST_APPEND
to fail.
As the LIST API was not designed to be used in a concurrent context,
we should not rely on LIST_INLIST() but on the newly introduced
LIST_INLIST_ATOMIC().
This patch simply reverts the commit above to switch to the new test,
saving a lock during potentially long operations. It was verified that
the check doesn't fail anymore.
It is unsure what the performance impact of the fix above could be in
some contexts. If any performance regression is observed, it could make
sense to backport this patch, along with the previous commit introducing
the LIST_INLIST_ATOMIC() macro.
We're using an XXH32() on the record to insert it into or look it up from
the tree. This way we don't change the rest of the code, the comparisons
are still made on all fields and the next node is visited on mismatch. This
also allows to continue to use roundrobin between identical nodes.
Just doing this is sufficient to see the CPU usage go down from ~60-70% to
4% at ~2k DNS requests per second for farm with 300 servers. A larger
config with 12 backends of 2000 servers each shows ~8-9% CPU for 6-10000
DNS requests per second.
It would probably be possible to go further with multiple levels of indexing
but it's not worth it, and it's important to remember that tree nodes take
space (the struct answer_list went back from 576 to 600 bytes).
With SRV records, a huge amount of time is spent looking for records
by walking long lists. It is possible to reduce this by indexing values
in trees instead. However the whole code relies a lot on the list
ordering, and even implements some round-robin on it to distribute IP
addresses to servers.
This patch starts carefully by replacing the list with a an eb32 tree
that is still used like a list, with a constant key 0. Since ebtrees
preserve insertion order for duplicates, the tree walk visits the nodes
in the exact same order it did with the lists. This allows to implement
the required infrastructure without changing the behavior.
When cleaning up the code to remove most explicit task masks in commit
beeabf531 ("MINOR: task: provide 3 task_new_* wrappers to simplify the
API"), a mistake was done with the external checks where the call does
task_new_on(1) instead of task_new_on(0) due to the confusion with the
previous mask 1.
No backport is needed as that's only 2.5-dev.
This one was used to indicate whether the callee had to follow particularly
safe code path when removing resolutions. Since the code now uses a kill
list, this is not needed anymore.
When scanning resolution.curr it's possible to try to free some
resolutions which will themselves result in freeing other ones. If
one of these other ones is exactly the next one in the list, the list
walk visits deleted nodes and causes memory corruption, double-frees
and so on. The approach taken using the "safe" argument to some
functions seems to work but it's extremely brittle as it is required
to carefully check all call paths from process_ressolvers() and pass
the argument to 1 there to refrain from deleting entries, so the bug
is very likely to come back after some tiny changes to this code.
A variant was tried, checking at various places that the current task
corresponds to process_resolvers() but this is also quite brittle even
though a bit less.
This patch uses another approach which consists in carefully unlinking
elements from the list and deferring their removal by placing it in a
kill list instead of deleting them synchronously. The real benefit here
is that the complexity only has to be placed where the complications
are.
A thread-local list is fed with elements to be deleted before scanning
the resolutions, and it's flushed at the end by picking the first one
until the list is empty. This way we never dereference the next element
and do not care about its presence or not in the list. One function,
resolv_unlink_resolution(), is exported and used outside, so it had to
be modified to use this list as well. Internal code has to use
_resolv_unlink_resolution() instead.
The code as it is uses crossed lists between many elements, and at
many places the code relies on list iterators or emptiness checks,
which does not work with only LIST_DELETE. Further, it is quite
difficult to place debugging code and checks in the current situation,
and gdb is helpless.
This code replaces all LIST_DELETE calls with LIST_DEL_INIT so that
it becomes possible to trust the lists.
This function allocates requesters by hand for each and every type. This
is complex and error-prone, and it doesn't even initialize the list part,
leaving dangling pointers that complicate debugging.
This patch introduces a new function resolv_get_requester() that either
returns the current pointer if valid or tries to allocate a new one and
links it to its destination. Then it makes use of it in the function
above to clean it up quite a bit. This allows to remove complicated but
unneeded tests.
Similar to the previous patch, the answer's list was only initialized the
first time it was added to a list, leading to bogus outdated pointer to
appear when debugging code is added around it to watch it. Let's make
sure it's always initialized upon allocation.
The query_list is physically stored in the struct resolution itself,
so we have a list that contains a list to items stored in itself (and
there is a single item). But the list is first initialized in
resolv_validate_dns_response(), while it's scanned in
resolv_process_responses() later after calling the former. First,
this results in crashes as soon as the code is instrumented a little
bit for debugging, as elements from a previous incarnation can appear.
But in addition to this, the presence of an element is checked by
verifying that the return of LIST_NEXT() is not NULL, while it may
never be NULL even for an empty list, resulting in bugs or crashes
if the number of responses does not match the list's contents. This
is easily triggered by testing for the list non-emptiness outside of
the function.
Let's make sure the list is always correct, i.e. it's initialized to
an empty list when the structure is allocated, elements are checked by
first verifying the list is not empty, they are deleted once checked,
and in any case at end so that there are no dangling pointers.
This should be backported, but only as long as the patch fits without
modifications, as adaptations can be risky there given that bugs tend
to hide each other.
Depending on the code that precedes the loop, gcc may emit this warning:
src/resolvers.c: In function 'resolv_process_responses':
src/resolvers.c:1009:11: warning: potential null pointer dereference [-Wnull-dereference]
1009 | if (query->type != DNS_RTYPE_SRV && flags & DNS_FLAG_TRUNCATED) {
| ~~~~~^~~~~~
However after carefully checking, r_res->header.qdcount it exclusively 1
when reaching this place, which forces the for() loop to enter for at
least one iteration, and <query> to be set. Thus there's no code path
leading to a null deref. It's possibly just because the assignment is
too far and the compiler cannot figure that the condition is always OK.
Let's just mark it to please the compiler.
This code is dangerous enough that we certainly don't want external code
to ever approach it, let's not export unnecessary functions like this one.
It was made static and a comment was added about its purpose.
There is a fundamental design bug in the resolvers code which is that
a list of active resolutions is being walked to try to delete outdated
entries, and that the code responsible for removing them also removes
other elements, including the next one which will be visited by the
list iterator. This randomly causes a use-after-free condition leading
to crashes, infinite loops and various other issues such as random memory
corruption.
A first fix for the memory fix for this was brought by commit 0efc0993e
("BUG/MEDIUM: resolvers: Don't release resolution from a requester
callbacks"). While preparing for more fixes, some code was factored by
commit 11c6c3965 ("MINOR: resolvers: Clean server in a dedicated function
when removing a SRV item"), which inadvertently passed "0" as the "safe"
argument all the time, missing one case of removal protection, instead
of always using "safe". This patch reintroduces the correct argument.
This must be backported with all fixes above.
Cc: Christopher Faulet <cfaulet@haproxy.com>
A few places have been caught triggering late bugs recently, always cases
of use-after-free because a freed element was still found in one of the
lists. This patch adds a few checks for such elements in dns_session_free()
before the final pool_free() and dns_session_io_handler() before adding
elements to lists to make sure they remain consistent. They do not trigger
anymore now.
When dns_session_release() calls dns_session_free(), it was shown that
it might still be attached there:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000006437d7 in dns_session_free (ds=0x7f895439e810) at src/dns.c:768
768 BUG_ON(!LIST_ISEMPTY(&ds->ring.waiters));
[Current thread is 1 (Thread 0x7f895bbe2700 (LWP 31792))]
(gdb) bt
#0 0x00000000006437d7 in dns_session_free (ds=0x7f895439e810) at src/dns.c:768
#1 0x0000000000643ab8 in dns_session_release (appctx=0x7f89545a4ff0) at src/dns.c:805
#2 0x000000000062e35a in si_applet_release (si=0x7f89545a5550) at include/haproxy/stream_interface.h:236
#3 0x000000000063150f in stream_int_shutw_applet (si=0x7f89545a5550) at src/stream_interface.c:1697
#4 0x0000000000640ab8 in si_shutw (si=0x7f89545a5550) at include/haproxy/stream_interface.h:437
#5 0x0000000000643103 in dns_session_io_handler (appctx=0x7f89545a4ff0) at src/dns.c:725
#6 0x00000000006d776f in task_run_applet (t=0x7f89545a5100, context=0x7f89545a4ff0, state=81924) at src/applet.c:90
#7 0x000000000068b82b in run_tasks_from_lists (budgets=0x7f895bbbf5c0) at src/task.c:611
#8 0x000000000068c258 in process_runnable_tasks () at src/task.c:850
#9 0x0000000000621e61 in run_poll_loop () at src/haproxy.c:2636
#10 0x0000000000622328 in run_thread_poll_loop (data=0x8d7440 <ha_thread_info+64>) at src/haproxy.c:2807
#11 0x00007f895c54a06b in start_thread () from /lib64/libpthread.so.0
#12 0x00007f895bf3772f in clone () from /lib64/libc.so.6
(gdb) p &ds->ring.waiters
$1 = (struct list *) 0x7f895439e8a8
(gdb) p ds->ring.waiters
$2 = {
n = 0x7f89545a5078,
p = 0x7f89545a5078
}
(gdb) p ds->ring.waiters->n
$3 = (struct list *) 0x7f89545a5078
(gdb) p *ds->ring.waiters->n
$4 = {
n = 0x7f895439e8a8,
p = 0x7f895439e8a8
}
Let's always detach it before freeing so that it remains possible to
check the dns_session's ring before releasing it, and possibly catch
bugs.
The barrier is insufficient here to protect the waiters list as we can
definitely catch situations where ds->waiter shows an inconsistency
whereby the element is not attached when entering the "if" block and
is already attached when attaching it later.
This patch uses a larger lock to maintain consistency. Without it the
code would crash in 30-180 minutes under heavy stress, always showing
the same problem (ds->waiter->n->p != &ds->waiter). Now it seems to
always resist, suggesting that this was indeed the problem.
This will have to be backported to 2.4.
Using tcp, after a session release and free, the session can remain
attached to the list of sessions with a response message waiting for
a commit (ds->waiter). This results to a use after free of this
session.
Also, on some error path and after free, a session could remain attached
to the lists of available idle/free sessions (ds->list).
This patch ensure to remove the session from those external lists
before a free.
This patch should be backported to all version including
the dns over tcp (2.4)
When an HTTP response is parsed, early parsing errors are not properly
handled. When this error is reported by the multiplexer, nothing is copied
into the input buffer. The HTX message remains empty but the
HTX_FL_PARSING_ERROR flag is set. In addition CS_FL_EOI is set on the
conn-stream. This last flag must be handled to prevent subscription for
receive events. Otherwise, in the best case, a L7 timeout error is
reported. But a transient loop is also possible if a shutdown is received
because the multiplexer notifies the check of the event while the check
never handles it and waits for more data.
Now, if CS_FL_EOI flag is set on the conn-stream, expect rules are
evaluated. Any error must be handled there.
Thanks to @kazeburo for his valuable report.
This patch should fix the issue #1420. It must be backported at least to
2.4. On 2.3 and 2.2, there is no loop but the wrong error is reported (empty
response instead of invalid one). Thus it may also be backported as far as
2.2.
If a channel error (READ_ERRO|READ_TIMEOUT|WRITE_ERROR|WRITE_TIMEOUT) is
detected by the stream, in process_stream(), FLT_END analyers must be
preserved. It is important to be sure to ends filter analysis and be able to
release the stream.
First, filters may release some ressources when FLT_END analyzers are
called. Then, the CF_FL_ANALYZE flag is used to sync end of analysis for the
request and the response. If FLT_END analyzer is ignored on a channel, this
may block the other side and freeze the stream.
This patch must be backported to all stable versions
Replace the test based on the enum value of the algorithm by an explicit
switch statement in case someone reorders it for some reason (while
still managing not to break the regtest).
resolv_hostname_cmp() is bogus, it is applied on labels and not plain
names, but doesn't make any distinction between length prefixes and
characters, so it compares the labels lengths via tolower() as well.
The only reason for which it doesn't break is because labels cannot
be larger than 63 bytes, and that none of the common encoding systems
have upper case letters in the lower 63 bytes, that could be turned
into a different value via tolower().
Now that all labels are stored in lower case, we don't need to burn
CPU cycles in tolower() at run time and can use memcmp() instead of
resolv_hostname_cmp(). This results in a ~22% lower CPU usage on large
farms using SRV records:
before:
18.33% haproxy [.] resolv_validate_dns_response
10.58% haproxy [.] process_resolvers
10.28% haproxy [.] resolv_hostname_cmp
7.50% libc-2.30.so [.] tolower
46.69% total
after:
24.73% haproxy [.] resolv_validate_dns_response
7.78% libc-2.30.so [.] __memcmp_avx2_movbe
3.65% haproxy [.] process_resolvers
36.16% total
The whole code relies on performing case-insensitive comparison on
lookups, which is extremely inefficient.
Let's make sure that all labels to be looked up or sent are first
converted to lower case. Doing so is also the opportunity to eliminate
an inefficient memcpy() in resolv_dn_label_to_str() that essentially
runs over a few unaligned bytes at once. As a side note, that call
was dangerous because it relied on a sign-extended size taken from a
string that had to be sanitized first.
This is tagged medium because while this is 100% safe, it may cause
visible changes on the wire at the packet level and trigger bugs in
test programs.
Coverity reported in issue #1416 that label oom3 is not reachable in
function close_listener() added by commit 59a877dfd ("MINOR: listeners:
add clone_listener() to duplicate listeners at boot time"). The code
leading to it was removed during the development of the function, but
not the label itself.
Coverity noticed in issue #1416 that a missing allocation error was
introduced in tcp_bind_listener() with the rework of error messages by
commit ed1748553 ("MINOR: proto_tcp: use chunk_appendf() to ouput socket
setup errors"). In practice nobody will ever face it but better address
it anyway.
No backport is needed.
When the clone_listener() function was added in commit 59a877dfd
("MINOR: listeners: add clone_listener() to duplicate listeners at
boot time"), a stupid bug was introduced when splitting the error
path because while the first case where calloc fails will leave NULL
in the output value, the other cases will return the pointer to a
freed area. This was reported by Coverity in issue #1416.
In practice nobody will face it (out-of-memory while checking config),
but let's fix it.
No backport is needed.
Commit 7a06ffb85 ("BUG/MEDIUM: sample: Cumulate frontend and backend
sample validity flags") introduced a typo confusing the request and
the response direction when checking for validity of a rule applied
to a backend. This was reported by Coverity in issue #1417.
This needs to be backported where the patch above is backported.
Fix the macro used to retrieve the max number of cpus on FreeBSD. The
MAXCPU is not properly defined in userspace and always set to 1 despite
the machine architecture. Replace it with CPU_SETSIZE.
See https://freebsd-hackers.freebsd.narkive.com/gw4BeLum/smp-in-machine-params-h#post6
Without this, the following config file is rejected on FreeBSD even if
the machine is SMP :
global
cpu-map 1-2 0-1
This must be backported up to 2.4.
It is now possible to have TCP/HTTP rules and ACLs defined in defaults
sections. So we must try to release corresponding lists when a default proxy
is destroyed.
No backport needed.
When the sample validity flags are computed to check if a sample is used in
a valid scope, the flags depending on the proxy capabilities must be
cumulated. Historically, for a sample on the request, only the frontend
capability was used to set the sample validity flags while for a sample on
the response only the backend was used. But it is a problem for listen or
defaults proxies. For those proxies, all frontend and backend samples should
be valid. However, at many place, only frontend ones are possible.
For instance, it is impossible to set the backend name (be_name) into a
variable from a listen proxy.
This bug exists on all stable versions. Thus this patch should probably be
backported. But with some caution because the code has probably changed
serveral times. Note that nobody has ever noticed this issue. So the need to
backport this patch must be evaluated for each branch.
As for TCP rules, HTTP rules from defaults section are now evaluated. These
rules are evaluated before those of the proxy. The same default ruleset
cannot be attached to the frontend and the backend. However, at this stage,
we take care to not execute twice the same ruleset. So, in theory, a
frontend and a backend could use the same defaults section. In this case,
the default ruleset is executed before all others and only once.
TCP rules from defaults section are now evaluated. These rules are evaluated
before those of the proxy. For L7 TCP rules, the same default ruleset cannot
be attached to the frontend and the backend. However, at this stage, we take
care to not execute twice the same ruleset. So, in theory, a frontend and a
backend could use the same defaults section. In this case, the default
ruleset is executed before all others and only once.
TCP and HTTP rules can now be defined in defaults sections, but only those
with a name. Because these rules may use conditions based on ACLs, ACLs can
also be defined in defaults sections.
However there are some limitations:
* A defaults section defining TCP/HTTP rules cannot be used by a defaults
section
* A defaults section defining TCP/HTTP rules cannot be used bu a listen
section
* A defaults sections defining TCP/HTTP rules cannot be used by frontends
and backends at the same time
* A defaults sections defining 'tcp-request connection' or 'tcp-request
session' rules cannot be used by backends
* A defaults sections defining 'tcp-response content' rules cannot be used
by frontends
The TCP request/response inspect-delay of a proxy is now inherited from the
defaults section it uses. For now, these rules are only parsed. No evaluation is
performed.
With the commit eaba25dd9 ("BUG/MINOR: tcpcheck: Don't use arg list for
default proxies during parsing"), we restricted the use of sample fetch in
tcpcheck rules defined in a defaults section to those depending on explicit
arguments only. This means a tcpcheck rules defined in a defaults section
cannot rely on argument unresolved during the configuration parsing.
Thanks to recent changes, it is now possible again.
This patch is mandatory to support TCP/HTTP rules in defaults sections.
When the parsing of a defaults section is started, the previous anonymous
defaults section is removed. It may be a problem with referenced defaults
sections. And because all unused defautl proxies are removed after the
configuration parsing, it is not required to remove it so early.
This patch is mandatory to support TCP/HTTP rules in defaults sections.
If a not-ready default proxy is referenced by a proxy during the
configuration validity check, its configuration is also finished and
PR_FL_READY flag is set on it.
For now, the arguments resolution is the only step performed.
This patch is mandatory to support TCP/HTTP rules in defaults sections.
The PR_FL_READY flags must now be set on a proxy at the end of the
configuration validity check to notify it is fully configured and may be
safely used.
For now there is no real usage of this flag. But it will be usefull for
referenced default proxies to finish their configuration only once.
This patch is mandatory to support TCP/HTTP rules in defaults sections.
A proxy may now references the defaults section it is used. To do so, a
pointer on the default proxy was added in the proxy structure. And a
refcount must be used to track proxies using a default proxy. A default
proxy is destroyed iff its refcount is equal to zero and when it drops to
zero.
All this stuff must be performed during init/deinit staged for now. All
unreferenced default proxies are removed after the configuration parsing.
This patch is mandatory to support TCP/HTTP rules in defaults sections.
It is now possible to designate the defaults section to use by adding a name
of the corresponding defaults section and referencing it in the desired
proxy section. However, this introduces an ambiguity. This named defaults
section may still be implicitly used by other proxies if it is the last one
defined. In this case for instance:
default common
...
default frt from common
...
default bck from common
...
frontend fe from frt
...
backend be from bck
...
listen stats
...
Here, it is not really obvious the last section will use the 'bck' defaults
section. And it is probably not the expected behaviour. To help users to
properly configure their haproxy, a warning is now emitted if a defaults
section is explicitly AND implicitly used. The configuration manual was
updated accordingly.
Because this patch adds a warning, it should probably not be backported to
2.4. However, if is is backported, it depends on commit "MINOR: proxy:
Introduce proxy flags to replace disabled bitfield".
It is not yet used but thanks to this patch, it will be possible to resolve
arguments found in defaults sections. However, there is some restrictions:
* For FE (frontend) or BE (backend) arguments, if the proxy is explicity
defined, there is no change. But for implicit proxy (not specified), the
argument points on the default proxy. when a sample fetch using this
kind of argument is evaluated, the default proxy replaced by the current
one.
* For SRV (server) and TAB (stick-table)arguments, the proxy must always
be specified. Otherwise an error is reported.
This patch is mandatory to support TCP/HTTP rules in defaults sections.
This change is required to support TCP/HTTP rules in defaults sections. The
'disabled' bitfield in the proxy structure, used to know if a proxy is
disabled or stopped, is replaced a generic bitfield named 'flags'.
PR_DISABLED and PR_STOPPED flags are renamed to PR_FL_DISABLED and
PR_FL_STOPPED respectively. In addition, everywhere there is a test to know
if a proxy is disabled or stopped, there is now a bitwise AND operation on
PR_FL_DISABLED and/or PR_FL_STOPPED flags.
.disabled field in the proxy structure is documented to be a bitfield. So
use it as a bitfield. This change was introduced to the 2.5, by commit
8e765b86f ("MINOR: proxy: disabled takes a stopping and a disabled state").
No backport is needed except if the above commit is backported.
The test on the return value of fix_tag_value() function was inverted. To
wait for more data, the return value must be a valid empty string and not
IST_NULL.
This patch must be backported to 2.4.
http-after-response rules evaluation must be stopped after a "allow". It
means the frontend ruleset must not be evaluated if a "allow" was performed
in the backend ruleset. Internally, the evaluation must be stopped if on
HTTP_RULE_RES_STOP return value. Only the "allow" action is concerned by
this change.
Thanks to this patch, http-response and http-after-response behave in the
same way.
This patch should be backported as far as 2.2.
This is the same as for commit 468c000db ("BUG/MEDIUM: jwt: fix base64
decoding error detection"), but for function sample_conv_jwt_member_query()
that is used by sample converters jwt_header_query() and jwt_payload_query().
Thanks to Tim for the report. No backport is needed.
As Tim reported in github issue #1414, we ought to use a constant-time
memcmp() when comparing hashes to avoid time-based attacks. Let's use
CRYPTO_memcmp() since this code already depends on openssl.
No backport is needed, this was just merged into 2.5.
Tim reported that a decoding error from the base64 function wouldn't
be matched in case of bad input, and could possibly cause trouble
with -1 being passed in decoded_sig->data. In the case of HMAC+SHA
it is harmless as the comparison is made using memcmp() after checking
for length equality, but in the case of RSA/ECDSA this result is passed
as a size_t to EVP_DigetVerifyFinal() and may depend on the lib's mood.
The fix simply consists in checking the intermediary result before
storing it.
That's precisely what happens with one of the regtests which returned
0 instead of 4 on the intentionally defective token, so the regtest
was fixed as well.
No backport is needed as this is new in this release.
A bug was introduced by commit previous bf9498a31 ("MINOR: resolvers:
fix the resolv_str_to_dn_label() API about trailing zero") as the code
is particularly contrived and hard to test. The output writes the last
char at [i+1] so the trailing zero and return value must be at i+1.
This will have to be backported where the patch above is backported
since it was needed for a fix.
These two fields are exclusive as they depend on the data type.
Let's move them into a union to save some precious bytes. This
reduces the struct resolv_answer_item size from 600 to 576 bytes.
The struct resolv_answer_item contains an address field of type
"sockaddr" which is only 16 bytes long, but which is used to store
either IPv4 or IPv6. Fortunately, the contents only overlap with
the "target" field that follows it and that is large enough to
absorb the extra bytes needed to store AAAA records. But this is
dangerous as just moving fields around could result in memory
corruption.
The fix uses a union and removes the casts that were used to hide
the problem.
Older versions need to be checked and possibly fixed. This needs
to be backported anyway.
In multi-threaded mode, on operating systems supporting multiple listeners on
the same IP:port, this will automatically create this number of multiple
identical listeners for the same line, all bound to a fair share of the number
of the threads attached to this listener. This can sometimes be useful when
using very large thread counts where the in-kernel locking on a single socket
starts to cause a significant overhead. In this case the incoming traffic is
distributed over multiple sockets and the contention is reduced. Note that
doing this can easily increase the CPU usage by making more threads work a
little bit.
If the number of shards is higher than the number of available threads, it
will automatically be trimmed to the number of threads. A special value
"by-thread" will automatically assign one shard per thread.
This function's purpose will be to duplicate a listener in INIT state.
This will be used to ease declaration of listeners spanning multiple
groups, which will thus require multiple FDs hence multiple receivers.
With groups at some point we'll have to have distinct masks/groups in the
receiver and the bind_conf, because a single bind_conf might require to
instantiate multiple receivers (one per group).
Let's split the thread mask and group to have one for the bind_conf and
another one for the receiver while it remains easy to do. This will later
allow to use different storage for the bind_conf if needed (e.g. support
multiple groups).
This function suffers from the same API issue as its sibling that does the
opposite direction, it demands that the input string is zero-terminated
*and* that its length *including* the trailing zero is passed on input,
forcing callers to pass length + 1, and itself to use that length - 1
everywhere internally.
This patch addressess this. There is a single caller, which is the
location of the previous bug, so it should probably be backported at
least to keep the code consistent across versions. Note that the
function is called dns_dn_label_to_str() in 2.3 and earlier.
An off-by-one issue in buffer size calculation used to limit the output
of resolv_dn_label_to_str() to 254 instead of 255.
This must be backported to 2.0.
In issue #1411, @jjiang-stripe reports that do-resolve() sometimes seems
to be trying to resolve crap from random memory contents.
The issue is that action_prepare_for_resolution() tries to measure the
input string by itself using strlen(), while resolv_action_do_resolve()
directly passes it a pointer to the sample, omitting the known length.
Thus of course any other header present after the host in memory are
appended to the host value. It could theoretically crash if really
unlucky, with a buffer that does not contain any zero including in the
index at the end, and if the HTX buffer ends on an allocation boundary.
In practice it should be too low a probability to have ever been observed.
This patch modifies the action_prepare_for_resolution() function to take
the string length on with the host name on input and pass that down the
chain. This should be backported to 2.0 along with commit "MINOR:
resolvers: fix the resolv_str_to_dn_label() API about trailing zero".
This function is bogus at the API level: it demands that the input string
is zero-terminated *and* that its length *including* the trailing zero is
passed on input. While that already looks smelly, the trailing zero is
copied as-is, and is then explicitly replaced with a zero... Not only
all callers have to pass hostname_len+1 everywhere to work around this
absurdity, but this requirement causes a bug in the do-resolve() action
that passes random string lengths on input, and that will be fixed on a
subsequent patch.
Let's fix this API issue for now.
This patch will have to be backported, and in versions 2.3 and older,
the function is in dns.c and is called dns_str_to_dn_label().
Some protocols fail with "error blah [ip:port]" and other fail with
"[ip:port] error blah". All this already appears in a "starting" or
"binding" context after a proxy name. Let's choose a more universal
approach like below where the ip:port remains at the end of the line
prefixed with "for".
[WARNING] (18632) : Binding [binderr.cfg:10] for proxy http: cannot bind receiver to device 'eth2' (No such device) for [0.0.0.0:1080]
[WARNING] (18632) : Starting [binderr.cfg:10] for proxy http: cannot set MSS to 12 for [0.0.0.0:1080]
Binding errors and late socket errors provide no information about
the file and line where the problem occurs. These are all done by
protocol_bind_all() and they only report "Starting proxy blah". Let's
change this a little bit so that:
- the file name and line number of the faulty bind line is alwas mentioned
- early binding errors are indicated with "Binding" instead of "Starting".
Now we can for example have this:
[WARNING] (18580) : Binding [binderr.cfg:10] for proxy http: cannot bind receiver to device 'eth2' (No such device) [0.0.0.0:1080]
The MSS errors are the only ones not indicating what was attempted, let's
report the value that was tried, as it can help users spot them in the
config (particularly if a default value was used).
Right now only the last warning or error is reported from
tcp_bind_listener(), but it is useful to report all warnings and no only
the last one, so we now emit them delimited by commas. Previously we used
a fixed buffer of 100 bytes, which was too small to store more than one
message, so let's extend it.
Signed-off-by: Bjoern Jacke <bjacke@samba.org>
This new converter takes a JSON Web Token, an algorithm (among the ones
specified for JWS tokens in RFC 7518) and a public key or a secret, and
it returns a verdict about the signature contained in the token. It does
not simply return a boolean because some specific error cases cas be
specified by returning an integer instead, such as unmanaged algorithms
or invalid tokens. This enables to distinguich malformed tokens from
tampered ones, that would be valid format-wise but would have a bad
signature.
This converter does not perform a full JWT validation as decribed in
section 7.2 of RFC 7519. For instance it does not ensure that the header
and payload parts of the token are completely valid JSON objects because
it would need a complete JSON parser. It only focuses on the signature
and checks that it matches the token's contents.
Those converters allow to extract a JSON value out of a JSON Web Token's
header part or payload part (the two first dot-separated base64url
encoded parts of a JWS in the Compact Serialization format).
They act as a json_query call on the corresponding decoded subpart when
given parameters, and they return the decoded JSON subpart when no
parameter is given.
A JWT signed with the RSXXX or ESXXX algorithm (RSA or ECDSA) requires a
public certificate to be verified and to ensure it is valid. Those
certificates must not be read on disk at runtime so we need a caching
mechanism into which those certificates will be loaded during init.
This is done through a dedicated ebtree that is filled during
configuration parsing. The path to the public certificates will need to
be explicitely mentioned in the configuration so that certificates can
be loaded as early as possible.
This tree is different from the ckch one because ckch entries are much
bigger than the public certificates used in JWT validation process.
This helper function splits a JWT under Compact Serialization format
(dot-separated base64-url encoded strings) into its different sub
strings. Since we do not want to manage more than JWS for now, which can
only have at most three subparts, any JWT that has strictly more than
two dots is considered invalid.
The full list of possible algorithms used to create a JWS signature is
defined in section 3.1 of RFC7518. This patch adds a helper function
that converts the "alg" strings into an enum member.
This fetch can be used to retrieve the data contained in an HTTP
Authorization header when the Bearer scheme is used. This is used when
transmitting JSON Web Tokens for instance.
On receiving CONNECTION_CLOSE frame, the mux is flagged for immediate
connection close. A stream is closed even if there is data not ACKed
left if CONNECTION_CLOSE has been received.
The mux tx buffers have been rewritten with buffers attached to qcs
instances. qc_buf_available and qc_get_buf functions are updated to
manipulates qcs. All occurences of the unused qcc ring buffer are
removed to ease the code maintenance.
Defer the shutting of a qcs if there is still data in its tx buffers. In
this case, the conn_stream is closed but the qcs is kept with a new flag
QC_SF_DETACH.
On ACK reception, the xprt wake up the shut_tl tasklet if the stream is
flagged with QC_SF_DETACH. This tasklet is responsible to free the qcs
and possibly the qcc when all bidirectional streams are removed.
For the moment, a quic connection is considered dead if it has no
bidirectional streams left on it. This test is implemented via
qcc_is_dead function. It can be reused to properly close the connection
when needed.
Properly handle tx buffers management in h3 data sending. If there is
not enough contiguous space, the buffer is first realigned. If this is
not enough, the stream is flagged with QC_SF_BLK_MROOM waiting for the
buffer to be emptied.
If a frame on a stream is successfully pushed for sending, the stream is
called if it was flagged with QC_SF_BLK_MROOM.
Remove the tx mux ring buffers in qcs, which should be in the qcc. For
the moment, use a simple architecture with 2 simple tx buffers in the
qcs.
The first buffer is used by the h3 layer to prepare the data. The mux
send operation transfer these into the 2nd buffer named xprt_buf. This
buffer is only freed when an ACK has been received.
This architecture is functional but not optimal for two reasons :
- it won't limit the buffer usage by connection
- each transfer on a new stream requires an allocation
This new ssllib_name_startswith precondition check can be used to
distinguish application linked with OpenSSL from the ones linked with
other SSL libraries (LibreSSL or BoringSSL namely). This check takes a
string as input and returns 1 when the SSL library's name starts with
the given string. It is based on the OpenSSL_version function which
returns the same output as the "openssl version" command.
Set an `lua_atpanic()` handler before calling `hlua_prepend_path()` in
`hlua_config_prepend_path()`.
This prevents the process from abort()ing when `hlua_prepend_path()` fails
for some reason.
see GitHub Issue #1409
This is a very minor issue that can't happen in practice. No backport needed.
This line is not related to the response channel but to the stream. Thus it
must be indented at the same level as stream-interfaces, connections,
channels...
Filters can block the stream on pre/post analysis for any reason and it can
be useful to report it in "show sess all". So now, a "current_filter" extra
line is reported for each channel if a filter is blocking the analysis. Note
that this does not catch the TCP/HTTP payload analysis because all
registered filters are always evaluated when more data are received.
Sometimes an HTTP or TCP rule may take time to complete because it is
waiting for external data (e.g. "wait-for-body", "do-resolve"), and it
can be useful to report the action and the location of that rule in
"show sess all". Here for streams blocked on such a rule, there will
now be a "current_line" extra line reporting this. Note that this does
not catch rulesets which are re-evaluated from the start on each change
(e.g. tcp-request content waiting for changes) but only when a specific
rule is being paused.
These ones are passed on rule creation for the sole purpose of being
reported in "show sess", which is not done yet. For now the entries
are allocated upon rule creation and freed in free_act_rules().
Rules are currently allocated using calloc() by their caller, which does
not make it very convenient to pass more information such as the file
name and line number.
This patch introduces new_act_rule() which performs the malloc() and
already takes in argument the ruleset (ACT_F_*), the file name and the
line number. This saves the caller from having to assing ->from, and
will allow to improve the internal storage with more info.
There have been a large number of issues reported with conn_cur
synchronization because the concept is wrong. In an active-passive
setup, pushing the local connections count from the active node to
the passive one will result in the passive node to have a higher
counter than the real number of connections. Due to this, after a
switchover, it will never be able to close enough connections to
go down to zero. The same commonly happens on reloads since the new
process preloads its values from the old process, and if no connection
happens for a key after the value is learned, it is impossible to reset
the previous ones. In active-active setups it's a bit different, as the
number of connections reflects the number on the peer that pushed last.
This patch solves this by marking the "conn_cur" local and preventing
it from being learned from peers. It is still pushed, however, so that
any monitoring system that collects values from the peers will still
see it.
The patch is tiny and trivially backportable. While a change of behavior
in stable branches is never welcome, it remains possible to fix issues
if reports become frequent.
In the configuration sometimes we'll omit a thread group number to designate
a global thread number range, and sometimes we'll mention the group and
designate IDs within that group. The operation is more complex than it
seems due to the need to check for ranges spanning between multiple groups
and determining groups from threads from bit masks and remapping bit masks
between local/global.
This patch adds a function to perform this operation, it takes a group and
mask on input and updates them on output. It's designed to be used by "bind"
lines but will likely be usable at other places if needed.
For situations where specified threads do not exist in the group, we have
the choice in the code between silently fixing the thread set or failing
with a message. For now the better option seems to return an error, but if
it turns out to be an issue we can easily change that in the future. Note
that it should only happen with "x/even" when group x only has one thread.
This extends the "thread" statement of bind lines to support an optional
thread group number. When unspecified (0) it's an absolute thread range,
and when specified it's one relative to the thread group. Masks are still
used so no more than 64 threads may be specified at once, and a single
group is possible. The directive is not used for now.
Now thread dumps will report the thread group number and the ID within
this group. Note that this is still quite limited because some masks
are calculated based on the thread in argument while they have to be
performed against a group-level thread ID.
This is the equivalent of "tid" for ease of access. In the future if we
make th_cfg a pure thread-local array (not a pointer), it may make sense
to move it there.
ha_set_tid() was randomly used either to explicitly set thread 0 or to
set any possibly incomplete thread during boot. Let's replace it with
a pointer to a valid thread or NULL for any thread. This allows us to
check that the designated threads are always valid, and to ignore the
thread 0's mapping when setting it to NULL, and always use group 0 with
it during boot.
The initialization code is also cleaner, as we don't pass ugly casts
of a thread ID to a pointer anymore.
This will be a convenient way to communicate the thread ID and its
local ID in the group, as well as their respective bits when creating
the threads or when only a pointer is given.
This will ease the reporting of the current thread group ID when coming
from the thread itself, especially since it returns the visible ID,
starting at 1.
This takes care of unassigned threads groups and places unassigned
threads there, in a more or less balanced way. Too sparse allocations
may still fail though. For now with a maximum group number fixed to 1
nothing can really fail.
This registers a mapping of threads to groups by enumerating for each thread
what group it belongs to, and marking the group as assigned. It takes care of
checking for redefinitions, overlaps, and holes. It supports both individual
numbers and ranges. The thread group is referenced from the thread config.
This creates a struct tgroup_info which knows the thread ID of the first
thread in a group, and the number of threads in it. For now there's only
one thread group supported in the configuration, but it may be forced to
other values for development purposes by defining MAX_TGROUPS, and it's
enabled even when threads are disabled and will need to remain accessible
during boot to keep a simple enough internal API.
For the purpose of easing the configurations which do not specify a thread
group, we're starting group numbering at 1 so that thread group 0 can be
"undefined" (i.e. for "bind" lines or when binding tasks).
The goal will be to later move there some global items that must be
made per-group.
We want to make sure that the current thread_info accessed via "ti" will
remain constant, so that we don't accidentally place new variable parts
there and so that the compiler knows that info retrieved from there is
not expected to have changed between two function calls.
Only a few init locations had to be adjusted to use the array and the
rest is unaffected.
The last 3 fields were 3 list heads that are per-thread, and which are:
- the pool's LRU head
- the buffer_wq
- the streams list head
Moving them into thread_ctx completes the removal of dynamic elements
from the struct thread_info. Now all these dynamic elements are packed
together at a single place for a thread.
The TI_FL_STUCK flag is manipulated by the watchdog and scheduler
and describes the apparent life/death of a thread so it changes
all the time and it makes sense to move it to the thread's context
for an active thread.
The "thread_info" name was initially chosen to store all info about
threads but since we now have a separate per-thread context, there is
no point keeping some of its elements in the thread_info struct.
As such, this patch moves prev_cpu_time, prev_mono_time and idle_pct to
thread_ctx, into the thread context, with the scheduler parts. Instead
of accessing them via "ti->" we now access them via "th_ctx->", which
makes more sense as they're totally dynamic, and will be required for
future evolutions. There's no room problem for now, the structure still
has 84 bytes available at the end.
The scheduler contains a lot of stuff that is thread-local and not
exclusively tied to the scheduler. Other parts (namely thread_info)
contain similar thread-local context that ought to be merged with
it but that is even less related to the scheduler. However moving
more data into this structure isn't possible since task.h is high
level and cannot be included everywhere (e.g. activity) without
causing include loops.
In the end, it appears that the task_per_thread represents most of
the per-thread context defined with generic types and should simply
move to tinfo.h so that everyone can use them.
The struct was renamed to thread_ctx and the variable "sched" was
renamed to "th_ctx". "sched" used to be initialized manually from
run_thread_poll_loop(), now it's initialized by ha_set_tid() just
like ti, tid, tid_bit.
The memset() in init_task() was removed in favor of a bss initialization
of the array, so that other subsystems can put their stuff in this array.
Since the tasklet array has TL_CLASSES elements, the TL_* definitions
was moved there as well, but it's not a problem.
The vast majority of the change in this patch is caused by the
renaming of the structures.
We used to remap SI_TKILL to SI_LWP when SI_TKILL was not available
(e.g. FreeBSD) but that's ugly and since we need this only in a single
switch/case block in wdt.c it's even simpler and cleaner to perform the
two tests there, so let's do this.
The watchdog timer had no more reason for being shared with the struct
thread_info since the watchdog is the only user now. Let's remove it
from the struct and move it to a static array in wdt.c. This removes
some ifdefs and the need for the ugly mapping to empty_t that might be
subject to a cast to a long when compared to TIMER_INVALID. Now timer_t
is not known outside of wdt.c and clock.c anymore.
This removes the knowledge of clockid_t from anywhere but clock.c, thus
eliminating a source of includes burden. The unused clock_id field was
removed from thread_info, and the definition setting of clockid_t was
removed from compat.h. The most visible change is that the function
now_cpu_time_thread() now takes the thread number instead of a tinfo
pointer.
The code that deals with timer creation for the WDT was moved to clock.c
and is called with the few relevant arguments. This removes the need for
awareness of clock_id from wdt.c and as such saves us from having to
share it outside. The timer_t is also known only from both ends but not
from the public API so that we don't have to create a fake timer_t
anymore on systems which do not support it (e.g. macos).
This was previously open-coded in run_thread_poll_loop(). Now that
we have clock.c dedicated to such stuff, let's move the code there
so that we don't need to keep such ifdefs nor to depend on the
clock_id.
Instead of fiddling with before_poll and after_poll in
activity_count_runtime(), the function is now called by
clock_entering_poll() which passes it the number of microseconds
spent working. This allows to remove all calls to
activity_count_runtime() from the pollers.
The entering_poll/leaving_poll/measure_idle functions that were hard
to classify and used to move to various locations have now been placed
into clock.c since it's precisely about time-keeping. The functions
were renamed to clock_*. The samp_time and idle_time values are now
static since there is no reason for them to be read from outside.
There is currently a problem related to time keeping. We're mixing
the functions to perform calculations with the os-dependent code
needed to retrieve and adjust the local time.
This patch extracts from time.{c,h} the parts that are solely dedicated
to time keeping. These are the "now" or "before_poll" variables for
example, as well as the various now_*() functions that make use of
gettimeofday() and clock_gettime() to retrieve the current time.
The "tv_*" functions moved there were also more appropriately renamed
to "clock_*".
Other parts used to compute stolen time are in other files, they will
have to be picked next.
It was brought by a variable declared after some statements in commit
21185970c ("MINOR: proc: setting the process to produce a core dump on
FreeBSD."). It's worth noting that some versions of clang seem to ignore
-Wdeclaration-after-statement by default. No backport is needed.
Remove unused code in mux-quic. This is mostly code related to the
backend side. This code is untested for the moment, its removal will
simplify the code maintenance.
Remove an unneeded strdup invocation during QPACK huffman decoding. A
temporary storage buffer is passed by the function and exists after
decoding so no need to duplicate memory here.
We've found others places where the read0 is ignored because of an
incomplete frame parsing. This time, it happens during parsing of
CONTINUATION frames.
When frames are parsed, incomplete frames are properly handled and
H2_CF_DEM_SHORT_READ flag is set. It is also true for HEADERS
frames. However, for CONTINUATION frames, there is an exception. Besides
parsing the current frame, we try to peek header of the next one to merge
payload of both frames, the current one and the next one. Idea is to create
a sole HEADERS frame before parsing the payload. However, in this case, it
is possible to have an incomplete frame too, not the current one but the
next one. From the demux point of view, the current frame is complete. We
must go to the internal function h2c_decode_headers() to detect an
incomplete frame. And this case was not identified and fixed when
H2_CF_DEM_SHORT_READ flag was introduced in the commit b5f7b5296
("BUG/MEDIUM: mux-h2: Handle remaining read0 cases on partial frames")
This bug was reported in a comment of the issue #1362. The patch must be
backported as far as 2.0.
Remove the quic_conn from the receiver connection_ids tree on
quic_conn_free. This fixes a crash due to dangling references in the
tree after a quic connection release.
This operation must be conducted under the listener lock. For this
reason, the quic_conn now contains a reference to its attached listener.
Use the count of bidirectional streams to call qc_release in qc_detach.
We cannot inspect the by_id tree because uni-streams are never removed
from it. This allows the connection to be properly freed.
It is required that all qcs streams are in the by_id tree for the xprt
to function correctly. Without this, some ACKs are not properly emitted
by xprt.
Note that this change breaks the free of the connection because the
condition eb_is_empty in qc_detach is always true. This will be fixed in
a following patch.
It seems it was a bad idea to use the same function as for TCP ssl sockets
to initialize the SSL session objects for QUIC with ssl_bio_and_sess_init().
Indeed, this had as very bad side effects to generate SSL errors due
to the fact that such BIOs initialized for QUIC could not finally be controlled
via the BIO_ctrl*() API, especially BIO_ctrl() function used by very much other
internal OpenSSL functions (BIO_push(), BIO_pop() etc).
Others OpenSSL base QUIC implementation do not use at all BIOs to configure
QUIC connections. So, we decided to proceed the same way as ngtcp2 for instance:
only initialize an SSL object and call SSL_set_quic_method() to set its
underlying method. Note that calling this function silently disable this option:
SSL_OP_ENABLE_MIDDLEBOX_COMPAT.
We implement qc_ssl_sess_init() to initialize SSL sessions for QUIC connections
to do so with a retry in case of allocation failure as this is done by
ssl_bio_and_sess_init(). We also modify the code part for haproxy servers.
The "show pools" command provides some "allocated" and "used" estimates
on the pools objects, but this applies to the shared pool and the "used"
includes what is currently assigned to thread-local caches. It's possible
to know how much each thread uses, so let's dump the total size allocated
by thread caches as an estimate. It's only done when pools are enabled,
which explains why the patch adds quite a lot of ifdefs.
These ones are rarely used or only to waste CPU cycles waiting, and are
the last ones requiring system includes in thread.h. Let's uninline them
and move them to thread.c.
This removes the thread identifiers from struct thread_info and moves
them only in static array in thread.c since it's now the only file that
needs to touch it. It's also the only file that needs to include
pthread.h, beyond haproxy.c which needs it to start the poll loop. As
a result, much less system includes are needed and the LoC reduced by
around 3%.
haproxy.c still has to deal with pthread-specific low-level stuff that
is OS-dependent. We should not have to deal with this there, and we do
not need to access pthread anywhere else.
Let's move these 3 functions to thread.c and keep empty inline ones for
when threads are disabled.
It's not needed to inline it at all (one call per loop) and it introduces
dependencies, let's move it to fd.c.
Removing the few remaining includes that came with it further reduced
by ~0.2% the LoC and the build time is now below 6s.
It's pointless to inline this, it's called exactly once per poll loop,
and it depends on time.h which is quite deep. Let's move that to task.c
along with sched_report_idle().
The remaining large functions are those allocating/initializing and
occasionally freeing connections, conn_streams and sockaddr. Let's
move them to connection.c. In fact, cs_free() is the only one-liner
but let's move it along with the other ones since a call will be
small compared to the rest of the work done there.
The following inlined functions are particularly large (and probably not
inlined at all by the compiler), and together represent roughly half of
the file, while they're used at most once per connection. They were moved
to connection.c.
conn_upgrade_mux_fe, conn_install_mux_fe, conn_install_mux_be,
conn_install_mux_chk, conn_delete_from_tree, conn_init, conn_new,
conn_free
The following functions are quite heavy and have no reason to be kept
inlined:
srv_release_conn, srv_lookup_conn, srv_lookup_conn_next,
srv_add_to_idle_list
They were moved to server.c. It's worth noting that they're a bit
at the edge between server and connection and that maybe we could
create an idle-conn file for these in the near future.
We do not really need to have them inlined, and having xxhash.h included
by connection.h results in this 4700-lines file being processed 101 times
over the whole project, which accounts for 13.5% of the total size!
Additionally, half of the functions are only needed from connection.c.
Let's move the functions there and get rid of the painful include.
The build time is now down to 6.2s just due to this.
The hash type stored everywhere is XXH64_hash_t, which annoyingly forces
everyone to include the huge xxhash file. We know it's an uint64_t because
that's its purpose and the type is only made to abstract it on machines
where uint64_t is not availble. Let's switch the type to uint64_t
everywhere and avoid including xxhash from the type file.
This one is expensive in code size because it comes with xxhash.h at a
low level of dependency that's inherited at plenty of places, and for
a function does doesn't benefit from inlining and could possibly even
benefit from not being inline given that it's large and called from the
scheduler.
Moving it to activity.c reduces the LoC by 1.2% and the binary size by
~1kB.
This function has no reason for being inlined, it's called from non
critical places (once in pollers), is quite large and comes with
dependencies (time and freq_ctr). Let's move it to acitvity.c. That's
another 0.4% less LoC to build.
The idle time calculation stuff was moved to task.h by commit 6dfab112e
("REORG: sched: move idle time calculation from time.h to task.h") but
these two variables that are only maintained by task.{c,h} were still
left in time.{c,h}. They have to move as well.
These ones require openssl and are only built when it's enabled. There's
no point keeping them in sample.c when ssl_sample.c already deals with this
and the required includes. This also allows to remove openssl-compat.h
from sample.c and to further reduce the number of inclusions of openssl
includes, and the build time is now down to under 8 seconds.
These two counters were the only ones not in the global struct, while
the SSL freq counters or the req counts are already in it, this forces
stats.c to include ssl_sock just to know about them. Let's move them
over there with their friends. This reduces from 408 to 384 the number
of includes of opensslconf.h.
This one has nothing to do with ssl_sock as it manipulates the struct
server only. Let's move it to server.c and remove unneeded dependencies
on ssl_sock.h. This further reduces by 10% the number of includes of
opensslconf.h and by 0.5% the number of compiled lines.
This one doesn't use anything from an SSL context, it only checks the
type of the transport layer of a connection, thus it belongs to
connection.h. This is particularly visible due to all the ifdefs
around it in various call places.
These functions have no reason for being inlined, and they require some
includes with long dependencies. Let's move them to listener.c and trim
unused includes in listener.h.
The lock-debugging code in thread.h has no reason to be inlined. the
functions are quite fat and perform a lot of operations so there's no
saving keeping them inlined. Worse, most of them are in fact not
inlined, resulting in a significantly bigger executable.
This patch moves all this part from thread.h to thread.c. The functions
are still exported in thread.h of course. This results in ~166kB less
code:
text data bss dec hex filename
3165938 99424 897376 4162738 3f84b2 haproxy-before
2991987 99424 897376 3988787 3cdd33 haproxy-after
In addition the build time with thread debugging enabled has shrunk
from 19.2 to 17.7s thanks to much less code to be parsed in thread.h
that is included virtually everywhere.
pool-os.h relies on a number of includes solely because the
pool_alloc_area() function was inlined, and this only because we want
the normal version to be inlined so that we can track the calling
places for the memory profiler. It's worth noting that it already
does not work at -O0, and that when UAF is enabled we don't care a
dime about profiling.
This patch does two things at once:
- force-inline the functions so that pool_alloc_area() is still
inlined at -O0 to help track malloc() users ;
- uninline the UAF version of these (that rely on mmap/munmap)
and move them to pools.c so that we can remove all unneeded
includes.
Doing so reduces by ~270kB or 0.15% the total build size.
A number of files currently access activity counters but rely on their
definitions to be inherited from other files (task.c, backend.c hlua.c,
sock.c, pool.c, stats.c, fd.c).
backend.c, all muxes, backend.c started manipulating ebmb_nodes with
the introduction of idle conns but the types were inherited through
other includes. Let's add ebmbtree.h there.
The various variable-to-sample converters allow to turn a variable to
a sample of type string, sint or binary, but both the string one used
by strcmp() and the binary one used by secure_memcmp() are missing a
pointer check on the ability to the cast, making them crash if a
variable of type addr is used with strcmp(), or if an addr or bool is
used with secure_memcmp().
Let's rely on the new sample_conv_var2smp() function to run the proper
checks.
This will need to be backported to all supported version. It relies on
previous commits:
CLEANUP: server: always include the storage for SSL settings
CLEANUP: sample: rename sample_conv_var2smp() to *_sint
CLEANUP: sample: uninline sample_conv_var2smp_str()
MINOR: sample: provide a generic var-to-sample conversion function
For backports it's probably easier to check the sample_casts[] pointer
before calling it in sample_conv_strcmp() and sample_conv_secure_memcmp().
We're using variable-to-sample conversion at least 4 times in the code,
two of which are bogus. Let's introduce a generic conversion function
that performs the required checks.
This one only handles integers, contrary to its sibling with the suffix
_str that only handles strings. Let's rename it and uninline it since it
may well be used from outside.
The SSL stuff in struct server takes less than 3% of it and requires
lots of annoying ifdefs in the code just to take care of the cases
where the field is absent. Let's get rid of this and stop including
openssl-compat from server.c to detect NPN and ALPN capabilities.
This reduces the total LoC by another 0.4%.
Migrate the httpclient:get() method to named arguments so we can
specify optional arguments.
This allows to pass headers as an optional argument as an array.
The () in the method call must be replaced by {}:
local res = httpclient:get{url="http://127.0.0.1:9000/?s=99",
headers= {["X-foo"] = { "salt" }, ["X-bar"] = {"pepper" }}}
During httpclient_destroy, add a condition in the BUG_ON which checks
that the client was started before it has ended. A httpclient structure
could have been created without being started.
When using the lua httpclient, haproxy could crash because a b_xfer is
done in httpclient_xfer, which will do a zero-copy swap of the data in
the buffers. The ptr will then be free() by the pool.
However this can't work with a trash buffer, because the area was not
allocated from the pool buffer, so the pool is not suppose to free it
because it does not know this ptr, using -DDEBUG_MEMORY_POOLS will
result with a crash during the free.
Fix the problem by using b_force_xfer() instead of b_xfer which copy
the data instead. The problem still exist with the trash however, and
the trash API must be reworked.
Implement the garbage collector of the lua httpclient.
This patch declares the __gc method of the httpclient object which only
does a httpclient_stop_and_destroy().
httpclient_stop_and_destroy() tries to destroy the httpclient structure
if the client was stopped.
In the case the client wasn't stopped, it ask the client to stop itself
and to destroy the httpclient structure itself during the release of the
applet.
httpclient_destroy() must free all the ist in the httpclient structure,
the URL in the request, the vsn and reason in the response.
It also must free the list of headers of the response.
A bug was introduced by the commit 2d5650082 ("BUG/MEDIUM: http-ana: Reset
channels analysers when returning an error").
The request analyzers must be cleared when a redirect rule is applied. It is
not a problem if the redirect rule is inside an http-request ruleset because
the analyzer takes care to clear it. However, when it comes from a redirect
ruleset (via the "redirect ..." directive), because of the above commit,
the request analyzers are no longer cleared. It means some HTTP request
analyzers may be called while the request channel was already flushed. It is
totally unexpected and may lead to crash.
Thanks to Yves Lafon for reporting the problem.
This patch must be backported everywhere the above commit was backported.
When a filter is attached to a stream, the wrong FLT_END analyzer is added
on the request channel. AN_REQ_FLT_END must be added instead of
AN_RES_FLT_END. Because of this bug, the stream may hang on the filter
release stage.
It seems to be ok for HTTP filters (cache & compression) in HTTP mode. But
when enabled on a TCP proxy, the stream is blocked until the client or the
server timeout expire because data forwarding is blocked. The stream is then
prematurely aborted.
This bug was introduced by commit 26eb5ea35 ("BUG/MINOR: filters: Always set
FLT_END analyser when CF_FLT_ANALYZE flag is set"). The patch must be
backported in all stable versions.
time.h is a horrible place to put activity calculation, it's a
historical mistake because the functions were there. We already have
most of the parts in sched.{c,h} and these ones make an exception in
the middle, forcing time.h to include some thread stuff and to access
the before/after_poll and idle_pct values.
Let's move these 3 functions to task.h with the other ones. They were
prefixed with "sched_" instead of the historical "tv_" which already
made no sense anymore.
I don't know why I inlined this one, this makes no sense given that it's
only used for stats, and it starts a circular dependency on tinfo.h which
can be problematic in the future. In addition, all the stuff related to
idle time calculation should be with the rest of the scheduler, which
currently is in task.{c,h}, so let's move it there.
We'll need to improve the API to pass other arguments in the future, so
let's start to adapt better to the current use cases. task_new() is used:
- 18 times as task_new(tid_bit)
- 18 times as task_new(MAX_THREADS_MASK)
- 2 times with a single bit (in a loop)
- 1 in the debug code that uses a mask
This patch provides 3 new functions to achieve this:
- task_new_here() to create a task on the calling thread
- task_new_anywhere() to create a task to be run anywhere
- task_new_on() to create a task to run on a specific thread
The change is trivial and will allow us to later concentrate the
required adaptations to these 3 functions only. It's still possible
to call task_new() if needed but a comment was added to encourage the
use of the new ones instead. The debug code was not changed and still
uses it.
Work lists were a mechanism introduced in 1.8 to asynchronously delegate
some work to be performed on another thread via a dedicated task.
The only user was the listeners, to deal with the queue. Nowadays
the tasklets have made this much more convenient, and have replaced
work_lists in the listeners. It seems there will be no valid use case
of work lists anymore, so better get rid of them entirely and keep the
scheduler code cleaner.
__task_queue() must absolutely not be called with TICK_ETERNITY or it
will place a never-expiring node upfront in the timers queue, preventing
any timer from expiring until the process is restarted. Code was found
to cause this using "task_schedule(task, now_ms)" which does this one
millisecond every 49.7 days, so let's add a condition against this. It
must never trigger since any process susceptible to trigger it would
already accumulate tasks until it dies.
An extra test was added in wake_expired_tasks() to detect tasks whose
timeout would have been changed after being queued.
An improvement over this could be in the future to use a non-scalar
type (union/struct) for expiration dates so as to avoid the risk of
using them directly like this. But now_ms is already such a valid
time and this specific construct would still not be caught.
This could even be backported to stable versions to help detect other
occurrences if any.
For now, tcp-request and tcp-response content rules evaluation is
interrupted before the inspect-delay when the channel's buffer is full, the
RX path is blocked or when a shutdown for reads was received. To sum up, the
evaluation is interrupted when no more input data are expected. However, it
is not exhaustive. It also happens when end of input is reached (CF_EOI flag
set) or when a read error occurred (CF_READ_ERROR flag set).
Note that, AFAIK, it is only a problem on HAProy 2.3 and prior when a H1 to
H2 upgrade is performed. On newer versions, it works as expected because the
stream is not created at this stage.
This patch must be backported as far as 2.0.
During tcp/http check rules parsing, when a sample fetch or a log-format
string is parsed, the proxy's argument list used to track unresolved
argument is no longer passed for default proxies. It means it is no longer
possible to rely on sample fetches depending on the execution context (for
instance 'nbsrv').
It is important to avoid HAProxy crashes because these arguments are
resolved during the configuration validity check. But, default proxies are
not evaluated during this stage. Thus, these arguments remain unresolved.
It will probably be possible to relax this rule. But to ease backports, it
is forbidden for now.
This patch must be backported as far as 2.2. It depends on the commit
"MINOR: arg: Be able to forbid unresolved args when building an argument
list". It must be adapted for the 2.3 because PR_CAP_DEF capability was
introduced in the 2.4. A solution may be to test The proxy's id agains NULL.
In make_arg_list() function, unresolved dependencies are pushed in an
argument list to be resolved later, during the configuration validity
check. It is now possible to forbid such unresolved dependencies by omitting
<al> parameter (setting it to NULL). It is usefull when the parsing context
is not the same than the running context or when the parsing context is lost
after the startup stage. For instance, an argument may be defined in
defaults section during parsing and executed in a frontend/backend section.
The Lua tasks registered vi core.register_task() use a dangerous
task_schedule(task, now_ms) to start them, that will most of the
time work by accident, except when the time wraps every 49.7 days,
if now_ms is 0, because it's not valid to queue a task with an
expiration date set to TICK_ETERNITY, as it will fail all wakeup
checks and prevent all subsequent timers from being seen as expired.
The only solution in this case is to restart the process.
Fortunately for the vast majority of users it is extremely unlikely
to ever be met (only one millisecond every 49.7 days is at risk), but
this can be systematic for a process dealing with 1000 req/s, hence
the major tag.
The bug was introduced in 1.6-dev with commit 24f335340 ("MEDIUM: lua:
add coroutine as tasks."), so the fix must be backported to all stable
branches.
A time comparison was wrong in hlua_sleep_yield(), making the sleep()
code do nothing for periods of 24 days every 49 days. An arithmetic
comparison was performed on now_ms instead of using tick_is_expired().
This bug was added in 1.6-dev by commit 5b8608f1e ("MINOR: lua: core:
add sleep functions") so the fix should be backported to all stable
versions.
In case of error while calling a SSL_read or SSL_write, the
SSL_get_error function is called in order to know more about the error
that happened. If the error code is SSL_ERROR_SSL or SSL_ERROR_SYSCALL,
the error queue might contain more information on the error. This error
code was not used until now. But we now need to store it in order for
backend error fetches to catch all handshake related errors.
The change was required because the previous backend fetch would not
have raised anything if the client's certificate was rejected by the
server (and the connection interrupted). This happens because starting
from TLS1.3, the 'Finished' state on the client is reached before its
certificate is sent to the server (see the "Protocol Overview" part of
RFC 8446). The only place where we can detect that the server rejected the
certificate is after the first SSL_read call after the SSL_do_handshake
function.
This patch then adds an extra ERR_peek_error after the SSL_read and
SSL_write calls in ssl_sock_to_buf and ssl_sock_from_buf. This means
that it could set an error code in the SSL context a long time after the
handshake is over, hence the change in the error fetches.
The ssl_bc_hsk_err sample fetch will need to raise more errors than only
handshake related ones hence its renaming to a more generic ssl_bc_err.
This patch is required because some handshake failures that should have
been caught by this fetch (verify error on the server side for instance)
were missed. This is caused by a change in TLS1.3 in which the
'Finished' state on the client is reached before its certificate is sent
(and verified) on the server side (see the "Protocol Overview" part of
RFC 8446).
This means that the SSL_do_handshake call is finished long before the
server can verify and potentially reject the client certificate.
The ssl_bc_hsk_err will then need to be expanded to catch other types of
errors.
This change is also applied to the frontend fetches (ssl_fc_hsk_err
becomes ssl_fc_err) and to their string counterparts.
In case of a connection error happening after the SSL handshake is
completed, the error code stored in the connection structure would not
always be set, hence having some connection failures being described as
successful in the fc_conn_err or bc_conn_err sample fetches.
The most common case in which it could happen is when the SSL server
rejects the client's certificate. The SSL_do_handshake call on the
client side would be sucessful because the client effectively sent its
client hello and certificate information to the server, but the next
call to SSL_read on the client side would raise an SSL_ERROR_SSL code
(through the SSL_get_error function) which is decribed in OpenSSL
documentation as a non-recoverable and fatal SSL error.
This patch ensures that in such a case, the connection's error code is
set to a special CO_ERR_SSL_FATAL value.
HAproxy only handles "chunked" encoding internally. Because it is a gateway,
we stated it was not a problem if unknown encodings were applied on a
message because it is the recipient responsibility to accept the message or
not. And indeed, it is not a problem if both the client and the server
connections are using H1. However, Transfer-Encoding headers are dropped
from H2 messages. It is not a problem for chunk-encoded payload because
dechunking is performed during H1 parsing. But, for any other encodings, the
xferred H2 message is invalid.
It is also a problem for internal payload manipulations (lua,
filters...). Because the TE request headers are now sanitiezd, unsupported
encoding should not be used by servers. Thus it is only a problem for the
request messages. For this reason, such messages are now rejected. And if a
server decides to use an unknown encoding, the response will also be
rejected.
Note that it is pretty uncommon to use other encoding than "chunked" on the
request payload. So it is not necessary to backport it.
This patch should fix the issue #1301. No backport is needed.
According to the RFC7230, "chunked" encoding must not be applied more than
once to a message body. To handle this case, h1_parse_xfer_enc_header() is
now responsible to fail when a parsing error is found. It also fails if the
"chunked" encoding is not the last one for a request.
To help the parsing, two H1 parser flags have been added: H1_MF_TE_CHUNKED
and H1_MF_TE_OTHER. These flags are set, respectively, when "chunked"
encoding and any other encoding are found. H1_MF_CHNK flag is used when
"chunked" encoding is the last one.
Only chunk-encoded response payloads are supported by HAProxy. All other
transfer encodings are not supported and will be an issue if the HTTP
compression is enabled. So be sure only "trailers" is send in TE request
headers.
The patch is related to the issue #1301. It must be backported to all stable
versions. Be carefull for 2.0 and lower because the HTTP legacy must also be
fixed.
Transfer-Encoding header is not supported in HTTP/1.0. However, softwares
dealing with HTTP/1.0 and HTTP/1.1 messages may accept it and transfer
it. When a Content-Length header is also provided, it must be
ignored. Unfortunately, this may lead to vulnerabilities (request smuggling
or response splitting) if an intermediary is only implementing
HTTP/1.0. Because it may ignore Transfer-Encoding header and only handle
Content-Length one.
To avoid any security issues, when Transfer-Encoding and Content-Length
headers are found in a message, the close mode is forced. The same is
performed for HTTP/1.0 message with a Transfer-Encoding header only. This
change is conform to what it is described in the last HTTP/1.1 draft. See
also httpwg/http-core#879.
Note that Content-Length header is also removed from any incoming messages
if a Transfer-Encoding header is found. However it is not true (not yet) for
responses generated by HAProxy.
This kind of requests is now forbidden and rejected with a
413-Payload-Too-Large error.
It is unexpected to have a payload for GET/HEAD/DELETE requests. It is
explicitly allowed in HTTP/1.1 even if some servers may reject such
requests. However, HTTP/1.0 is not clear on this point and some old servers
don't expect any payload and never look for body length (via Content-Length
or Transfer-Encoding headers).
It means that some intermediaries may properly handle the payload for
HTTP/1.0 GET/HEAD/DELETE requests, while some others may totally ignore
it. That may lead to security issues because a request smuggling attack is
possible.
To prevent any issue, those requests are now rejected.
See also httpwg/http-core#904
When a parsing error is triggered, the status code may be customized by
setting H1C .errcode field. By default a 400-Bad-Request is returned. The
function h1_handle_bad_req() has been renamed to h1_handle_parsing_error()
to be more generic.
In h1_ctl(), if output parameter is provided when MUX_EXIT_STATUS is
returned, it is used to set the error code. In addition, any client errors
(4xx), except for 408 ones, are handled as invalid errors
(MUX_ES_INVALID_ERR). This way, it will be possible to customize the parsing
error code for request messages.
The mux .ctl callback can provide some information about the mux to the
caller if the third parameter is provided. Thus, when MUX_EXIT_STATUS is
retrieved, a pointer on the status is now passed. The mux may fill it. It
will be pretty handy to provide custom error code from h1 mux instead of
default ones (400/408/500/501).
The startup code was still ugly with tons of unreadable nested ifdefs.
Let's just have one function to set up the extra threads and another one
to wait for their completion. The ifdefs are isolated into their own
functions now and are more readable, just like the end of main(), which
now uses the same statements to start thread 0 with and without threads.
Till now the threads startup was quite messy:
- we would start all threads but one
- then we would change all threads' CPU affinities
- then we would manually start the poll loop for the current thread
Let's change this by moving the CPU affinity setting code to a function
set_thread_cpu_affinity() that does this job for the current thread only,
and that is called during the thread's initialization in the polling loop.
It takes care of not doing this for the master, and will result in all
threads to be properly bound earlier and with cleaner code. It also
removes some ugly nested ifdefs.
Probably because of some copy-paste from "nbproc", "nbthread" used to
be parsed in cfgparse instead of using a registered parser. Let's fix
this to clean up the code base now.
ASAN reported a buffer overflow in the httpclient. This overflow is the
consequence of ist0() which is incorrect here.
Replace all occurences of ist0() by istptr() which is more appropried
here since all ist in the httpclient were created from strings.
src/hlua.c:7074:6: error: variable 'url_str' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
if (lua_type(L, -1) == LUA_TSTRING)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
src/hlua.c:7079:36: note: uninitialized use occurs here
hlua_hc->hc->req.url = istdup(ist(url_str));
^~~~~~~
Return an error on the stack if the argument is not a string.
Provide a new field "headers" in the response of the HTTPClient, which
contains all headers of the response.
This field is a multi-dimensionnal table which could be represented this
way in lua:
headers = {
["content-type"] = { "text/html" },
["cache-control"] = { "no-cache" }
}
This commit provides an hlua_httpclient object which is a bridge between
the httpclient and the lua API.
The HTTPClient is callable in lua this way:
local httpclient = core.httpclient()
local response = httpclient:get("http://127.0.0.1:9000/?s=9999")
core.Debug("Status: ".. res.status .. ", Reason : " .. res.reason .. ", Len:" .. string.len(res.body) .. "\n")
The resulting response object will provide a "status" field which
contains the status code, a "reason" string which contains the reason
string, and a "body" field which contains the response body.
The implementation uses the httpclient callback to wake up the lua task
which yield each time it pushes some data. The httpclient works in the
same thread as the lua task.
According with the W3 CSS specification, media queries 5 allow
the browser to enable some CSS when dark mode is enabled. This
patch defines dark mode CSS for the stats page.
https://www.w3.org/TR/mediaqueries-5/#prefers-color-scheme
A bug was introduced in the commit cff0f739e5 ("MINOR: counters: Review
conditions to increment counters from analysers"). The internal_errors
counter for the target server was incremented twice. The counter for the
session listener needs to be incremented instead.
This must be backported everywhere the commit cff0f739e5 is.
The transient flag CO_RFL_BUF_NOT_STUCK should now be set when the mux's
rcv_buf() function is called, in si_cs_recv(), to be sure the mux is able to
perform some optimisation during data copy. This flag is set when we are
sure the channel buffer is not stuck. Concretely, it happens when there are
data scheduled to be sent.
It is not a fix and this flag is not used for now. But it makes sense to have
this info to be sure to be able to do some optimisations if necessary.
This patch is related to the issue #1362. It may be backported to 2.4 to
ease future backports.
The stream interface is now responsible for defragmenting the HTX message of
the input channel if necessary, before calling the mux's .rcv_buf()
function. The defrag is performed if the underlying buffer contains only
input data while the HTX message free space is not contiguous.
The defrag is important here to be sure the mux and the app layer have the
same criteria to decide if a buffer is full or not. Otherwise, the app layer
may wait for more data because the buffer is not full while the mux is
blocked because it needs more space to proceed.
This patch depends on following commits:
* MINOR: htx: Add an HTX flag to know when a message is fragmented
* MINOR: htx: Add a function to know if the free space wraps
This patch is related to the issue #1362. It may be backported as far as 2.0
after some observation period (not sure it is required or not).
HTX_FL_FRAGMENTED flag is now set on an HTX message when it is
fragmented. It happens when an HTX block is removed in the middle of the
message and flagged as unused. HTX_FL_FRAGMENTED flag is removed when all
data are removed from the message or when the message is defragmented.
Note that some optimisations are still possible because the flag can be
avoided in other situations. For instance when the last header of a bodyless
message is removed.
In si_cs_recv(), some CO_RFL flags are set when the mux's .rcv_buf()
function is called. Some are persitent inside si_cs_recv() scope, some
others must be computed at each call to rcv_buf(). This patch takes care of
distinguishing them.
Among others, CO_RFL_KEEP_RECV is a persistent flag while CO_RFL_BUF_WET is
transient.
If the stream-interface is waiting for more buffer room to store incoming
data, it is important at the stream level to stop to wait for more data to
continue. Thanks to the previous patch ("BUG/MEDIUM: stream-int: Notify
stream that the mux wants more room to xfer data"), the stream is woken up
when this happens. In this patch, we take care to interrupt the
corresponding tcp-content ruleset or to stop waiting for the HTTP message
payload.
To ease detection of the state, si_rx_blocked_room() helper function has
been added. It returns non-zero if the stream interface's Rx path is blocked
because of lack of room in the input buffer.
This patch is part of a series related to the issue #1362. It should be
backported as ar as 2.0, probably with some adaptations. So be careful
during backports.
When the mux failed to transfer data to the upper layer because of a lack of
room, it is important to wake the stream up to let it handle this
event. Otherwise, if the stream is waiting for more data, both the stream
and the mux reamin blocked waiting for each other.
When this happens, the mux set the CS_FL_WANT_ROOM flag on the
conn-stream. Thus, in si_cs_recv() we are able to detect this event. Today,
the stream-interface is blocked. But, it is not enough to wake the stream
up. To fix the bug, CF_READ_PARTIAL flag is extended to also handle cases
where a read exception occurred. This flag should idealy be renamed. But for
now, it is good enough. By setting this flag, we are sure the stream will be
woken up.
This patch is part of a series related to the issue #1362. It should be
backported as far as 2.0, probably with some adaptations. So be careful
during backports.
When a message is parsed and copied into the channel buffer, in
h1_process_demux(), more space is requested if some pending data remain
after the parsing while the channel buffer is not empty. To do so,
CS_FL_WANT_ROOM flag is set. It means the H1 parser needs more space in the
channel buffer to continue. In the stream-interface, when this flag is set,
the SI is considered as blocked on the RX path. It is only unblocked when
some data are sent.
However, it is not accurrate because the parsing may be stopped because
there is not enough data to continue. For instance in the middle of a chunk
size. In this case, some data may have been already copied but the parser is
blocked because it must receive more data to continue. If the calling SI is
blocked on RX at this stage when the stream is waiting for the payload
(because http-buffer-request is set for instance), the stream remains stuck
infinitely.
To fix the bug, we must request more space to the app layer only when it is
not possible to copied more data. Actually, this happens when data remain in
the input buffer while the H1 parser is in states MSG_DATA or MSG_TUNNEL, or
when we are unable to copy headers or trailers into a non-empty buffer.
The first condition is quite easy to handle. The second one requires an API
refactoring. h1_parse_msg_hdrs() and h1_parse_msg_tlrs() fnuctions have been
updated. Now it is possible to know when we need more space in the buffer to
copy headers or trailers (-2 is returned). In the H1 mux, a new H1S flag
(H1S_F_RX_CONGESTED) is used to track this state inside h1_process_demux().
This patch is part of a series related to the issue #1362. It should be
backported as far as 2.0, probably with some adaptations. So be careful
during backports.
In h1_postparse_req_hdrs(), if we need more space to copy headers, the request
parser is reset. However, because of a typo, it was reset as a response parser
instead of a request one. h1m_init_req() must be called.
This patch must be backported as far as 2.2.
We wake up the xprt as soon as STREAM frames have been pushed to
the TX mux buffer (->tx.buf).
We also make the mux subscribe() to the xprt layer if some data
remain in its ring buffer after having try to transfer them to the
xprt layer (TX mux buffer for the stream full).
Also do not consider a buffer in the ring if not allocated (see b_size(buf))
condition in the for(;;) loop.
Make a call to qc_process_mux() if possible when entering qc_send() to
fill the mux with data from streams in the send or flow control lists.
The FIN of a STREAM frame to be built must be set if there is no more
at all data in the ring buffer.
Do not do anything if there is nothing to transfer the ->tx.buf mux
buffer via b_force_xfer() (without zero copy)
When ACK have been received by the xprt, it must wake up the
mux if this latter has subscribed to SEND events. This is the
role of qcs_try_to_consume() to detect such a situation. This
is the function which consumes the buffer filled by the mux.
It is important to know if the packet number spaces used during the
handshakes have really been discarding. If not, this may have a
significant impact on the packet loss detection.
There were cases where the Initial packet number space was not discarded.
This leaded the packet loss detection to continue to take it into
considuration during the connection lifetime. Some Application level
packets could not be retransmitted.
QUIC_FL_TX_PACKET_ACK_ELICITING was replaced by QUIC_FL_RX_PACKET_ACK_ELICITING
by this commit due to a copy and paste:
e5b47b637 ("MINOR: quic: Add a mask for TX frame builders and their authorized packet types")
Furthermore the flags for the PADDING frame builder was not initialized.
The STREAM data to send coming from the upper layer must be stored until
having being acked by the peer. To do so, we store them in buffer structs,
one by stream (see qcs.tx.buf). Each time a STREAM is built by quic_push_frame(),
its offset must match the offset of the first byte added to the buffer (modulo
the size of the buffer) by the frame. As they are not always acknowledged in
order, they may be stored in eb_trees ordered by their offset to be sure
to sequentially delete the STREAM data from their buffer, in the order they
have been added to it.
The peer transport parameter values were not initialized with
the default ones (when absent), especially the
"active_connection_id_limit" parameter with 2 as default value
when absent from received remote transport parameters. This
had as side effect to send too much NEW_CONNECTION_ID frames.
This was the case for curl which does not announce any
"active_connection_id_limit" parameter.
Also rename ->idle_timeout to ->max_idle_timeout to reflect the RFC9000.
These salts are used to derive initial secrets to decrypt the first Initial packet.
We support draft-29 and v1 QUIC version initial salts.
Add parameters to our QUIC-TLS API functions used to derive these secret for
these salts.
Make our xprt_quic use the correct initial salt upon QUIC version field found in
the first paquet. Useful to support connections with curl which use draft-29
QUIC version.
Move the "ACK required" bit from the packet number space to the connection level.
Force the "ACK required" option when acknowlegding Handshake or Initial packet.
A client may send three packets with a different encryption level for each. So,
this patch modifies qc_treat_rx_pkts() to consider two encryption level passed
as parameters, in place of only one.
Make qc_conn_io_cb() restart its process after the handshake has succeeded
so that to process any Application level packets which have already been received
in the same datagram as the last CRYPTO frames in Handshake packets.
We must take as most as possible data from STREAM frames to be encapsulated
in QUIC packets, almost as this is done for CRYPTO frames whose fields are
variable length fields. The difference is that STREAM frames are only accepted
for short packets without any "Length" field. So it is sufficient to call
max_available_room() for that in place of max_stream_data_size() as this
is done for CRYPTO data.
It is possible the TLS stack stack provides us with 1-RTT TX secrets
at the same time as Handshake secrets are provided. Thanks to this
simple patch we can build Application level packets during the handshake.
Make qc_prep_hdshk_pkts() and qui_conn_io_cb() handle the case
where we enter them with QUIC_HS_ST_COMPLETE or QUIC_HS_ST_CONFIRMED
as connection state with QUIC_TLS_ENC_LEVEL_APP and QUIC_TLS_ENC_LEVEL_NONE
to consider to prepare packets.
quic_get_tls_enc_levels() is modified to return QUIC_TLS_ENC_LEVEL_APP
and QUIC_TLS_ENC_LEVEL_NONE as levels to consider when coalescing
packets in the same datagram.
With very few packets received by the listener, it is possible
that its state may move from QUIC_HS_ST_SERVER_INITIAL to
QUIC_HS_ST_COMPLETE without transition to QUIC_HS_ST_SERVER_HANDSHAKE state.
This latter state is not mandatory.
This simple enable use to coalesce Application level packet with
Handshake ones at the end of the handshake. This is highly useful
if we do want to send a short Handshake packet followed by Application
level ones.
We must evaluate the packet lenghts in advance to be sure we do not
consume a packet number for nothing. The packet building must always
succeeds. This is the role of qc_eval_pkt() implemented by this patch
called before calling qc_do_build_pkt() which was previously modified to
always succeed.
There were cases where the encoded size of acks was not updated leading
to ACK frames building too big compared to the expected size. At this
time, this makes the code "BUG_ON()".
Rename qc_build_hdshk_pkt() to qc_build_pkt() and qc_do_build_hdshk_pkt()
to qc_do_build_pkt().
Update their comments consequently.
Make qc_do_build_hdshk_pkt() BUG_ON() when it does not manage to build
a packet. This is a bug!
Remove the functions which were specific to the Application level.
This is the same function which build any packet for any encryption
level: quic_prep_hdshk_pkts() directly called from the quic_conn_io_cb().
There is no need to pass a copy of CRYPTO frames to qc_build_frm() from
qc_do_build_hdshk_pkt(). Furthermore, after the previous modifications,
qc_do_build_hdshk_pkt() do not build only CRYPTO frame from ->pktns.tx.frms
MT_LIST but any type of frame.
Atomically increase the "next packet variable" before building a new packet.
Make the code bug on a packet building failure. This should never happen
if we do not want to consume a packet number for nothing. There are remaining
modifications to come to ensure this is the case.
Modify this task which is called at least each a packet is received by a listener
so that to make it behave almost as qc_do_hdshk(). This latter is no more useful
and removed.
This function was responsible of building CRYPTO frames to fill as much as
possible a packet passed as argument. This patch makes it support any frame
except STREAM frames whose lengths are highly variable.
We want to treat all the frames to be built the same way as frames
built during handshake (CRYPTO frames). So, let't store them at the same
place which is an MT_LIST.
As this has been done for RX frame parsers, we add a mask for each TX frame
builder to denote the packet types which are authorized to embed such frames.
Each time a TX frame builder is called, we check that its mask matches the
packet type the frame is built for.
These structures are similar. quic_tx_frm was there to try to reduce the
size of such objects which embed a union for all the QUIC frames.
Furtheremore this patch fixes the issue where quic_tx_frm objects were freed
from the pool for quic_frame.
Make quic_rx_packet_ref(inc|dec)() functions be thread safe.
Make use of ->rx.crypto.frms_rwlock RW lock when manipulating RX frames
from qc_treat_rx_crypto_frms().
Modify atomically several variables attached to RX part of quic_enc_level struct.
->rx.crypto member of quic_enc_level struct was not initialized as
this was done for all other members of this structure. This patch
fixes this.
Also adds a RW lock for the frame of this member.
If we let the connection packet handler task (quic_conn_io_cb) process the first
client Initial packet which contain the TLS Client Hello message before the mux
context is initialized, quic_mux_transport_params_update() makes haproxy crash.
->start xprt callback already wakes up this task and is called after all the
connection contexts are initialized. So, this patch do not wakes up quic_conn_io_cb()
if the mux context is not initialized (this was already the case for the connection
context (conn_ctx)).
If we add TX packets to their trees before sending them, they may
be detected as lost before being sent. This may make haproxy crash
when it retreives the prepared packets from TX ring buffers, dereferencing
them after they have been freed.
We use only ring buffers (struct qring) to prepare and send QUIC datagrams.
We can safely remove the old buffering implementation which was not thread safe.
We modify the functions responsible of building packets to put these latters
in ring buffers (qc_build_hdshk_pkt() during the handshake step, and
qc_build_phdshk_apkt() during the post-handshake step). These functions
remove a ring buffer from its list to build as much as possible datagrams.
Eache datagram is prepended of two field: the datagram length and the
first packet in the datagram. We chain the packets belonging to the same datagram
in a singly linked list to reach them from the first one: indeed we must
modify some members of each packet when we really send them from send_ppkts().
This function is also modified to retrieved the datagram from ring buffers.
We initialize the pointer to the listener TX ring buffer list.
Note that this is not done for QUIC clients as we do not fully support them:
we only have to allocate the list and attach it to server struct I guess.
We allocate an array of QUIC ring buffer, one by thread, and arranges them in a
MT_LIST. Everything is allocated or nothing: we do not want to usse an incomplete
array of ring buffers to ensure that each thread may safely acquire one of these
buffers.
Before this patch we reserved 16 bytes (QUIC_TLS_TAG_LEN) before building the
handshake packet to be sure to be able to add the tag which comes with the
the packet encryption, decreasing the end offset of the building buffer by 16 bytes.
But this tag length was taken into an account when calling qc_build_frms() which
computes and build crypto frames for the remaining available room thanks to <*len>
parameter which is the length of the already present bytes in the building buffer
before adding CRYPTO frames. This leaded us to waste the 16 last bytes of the buffer
which were not used.
This make at least our listeners answer to ngtcp2 clients without
HelloRetryRequest message. It seems the server choses the first
group in the group list ordered by preference and set by
SSL_CTX_set1_curves_list() which match the client ones.
This implementation is inspired from Linux kernel circular buffer implementation
(see include/linux/circ-buf.h). Such buffers may be used at the same time both
by writer and reader (lock-free).
Modify the I/O dgram handler principal function used to parse QUIC packets
be thread safe. Its role is at least to create new incoming connections
add to two trees protected by the same RW lock. The packets are for now on
fully parsed before possibly creating new connections.
Allocate everything needed for a connection (struct quic_conn) from the same
function.
Rename qc_new_conn_init() to qc_new_conn() to reflect these modifications.
Insert these connection objects in their tree after returning from this function.
Some SSL call may be called with pointer to ssl_sock_ctx struct as parameter
which does not match the quic_conn_ctx struct type (see ssl_sock_infocb()).
I am not sure we have to keep such callbacks for QUIC but we must ensure
the SSL and QUIC xprts use the same data structure as context.
Move the connection state from quic_conn_ctx struct to quic_conn struct which
is the structure which is used to store the QUIC connection part information.
This structure is initialized by the I/O dgram handler for each new connection
to QUIC listeners. This is needed for the multithread support so that to not
to have to depend on the connection context potentially initialized by another
thread.
We must protect from concurrent the tree which stores the QUIC packets received
by the dgram I/O handler, these packets being also parsed by the xprt task.
No need to call free_quic_rx_packet() after calling quic_rx_packet_eb64_delete()
as this latter already calls quic_rx_packet_refdec() also called by
free_quic_rx_packet().
Let's say that we have to insert a range R between to others A and B
with A->first <= R->first <= B->first. We have to remove the ranges
which are overlapsed by R during. This was correctly done when
the intersection between A and R was not empty, but not when the
intersection between R and B was not empty. If this latter case
after having inserting a new range R we set <new> variable as the
node to consider to check the overlaping between R and its following
ranges.
Make depends qc_new_isecs() only on quic_conn struct initialization only (no more
dependency on connection struct initialization) to be able to run it as soon as
the quic_conn struct is initialized (from the I/O handler) before running ->accept()
quic proto callback.
We remove the header protection of packet only for connection with already
initialized context. This latter keep traces of the connection state.
Furthermore, we enqueue the first Initial packet for a new connection
after having completely parsed the packet so that to not start the accept
process for nothing.
Move the QUIC conn (struct quic_conn) initialization from quic_sock_accept_conn()
to qc_lstnr_pkt_rcv() as this is done for the server part.
Move the timer initialization to ->start xprt callback to ensure the connection
context is done : it is initialized by the ->accept callback which may be run
by another thread than the one for the I/O handler which also run ->start.
Move the call to SSL_set_quic_transport_params() from the listener I/O dgram
handler to the ->init() callback of the xprt (qc_conn_init()) which initializes
its context where is stored the SSL context itself, needed by
SSL_set_quic_transport_params(). Furthermore this is already what is done for the
server counterpart of ->init() QUIC xprt callback. As the ->init() may be run
by another thread than the one for the I/O handler, the xprt context could
not be potentially already initialized before calling SSL_set_quic_transport_params()
from the I/O handler.
The name the maximum packet size transport parameter was ambiguous and replaced
by maximum UDP payload size. Our code would be also ambiguous if it does not
reflect this change.
Set the streams transport parameters which could not be initialized because they
were not available during initializations. Indeed, the streams transport parameters
are provided by the peer during the handshake.
Really signal the caller that ->accept() has failed if the session could not
be initialized because conn_complete_session() has failed. This is the case
if the mux could not be initialized too.
When it fails an ->accept() must returns -1 in case of resource shortage.
Deactivate the action of this callback at this time. I am not sure
we will keep it for QUIC as it does not really make sense for QUIC:
the QUIC packet are already recvfrom()'ed by the low level I/O handler
used for all the connections.
This file has been derived from mux_h2.c removing all h2 parts. At
QUIC mux layer, there must not be any reference to http. This will be the
responsability of the application layer (h3) to open streams handled by the mux.
We move ->params transport parameters to ->rx.params. They are the
transport parameters which will be sent to the peer, and used for
the endpoint flow control. So, they will be used to received packets
from the peer (RX part).
Also move ->rx_tps transport parameters to ->tx.params. They are the
transport parameter which are sent by the peer, and used to respect
its flow control limits. So, they will be used when sending packets
to the peer (TX part).
This bug may occur when displaying streams traces. It came with this commit:
242fb1b63 ("MINOR: quic: Drop packets with STREAM frames with wrong direction.").
An optimization was brought in commit 5064ab6a9 ("OPTIM: lb-leastconn:
do not unlink the server if it did not change") to avoid locking the
server just to discover it did not move. However a mistake was made
because the operation involves a divide with a value that is read
outside of its usual lock, which makes it possible to be zero at the
exact moment we watch it if another thread takes the server down under
the lbprm lock, resulting in a divide by zero.
Therefore we must check that the value is not null there.
This must be backported to 2.4.
The "process" directive on "bind" lines becomes quite confusing considering
that the only allowed value is 1 for the process, and that threads are
optional and come after the mandatory "1/".
Let's introduce a new "thread" directive to directly configure thread
numbers, and mark "process" as deprecated. Now "process" will emit a
warning and will suggest how to be replaced with "thread" instead.
The doc was updated accordingly (mostly a copy-paste of the previous
description which was already up to date).
This is marked as MEDIUM as it will impact users having "zero-warning"
and "process" specified.
Enable the 'slowstart' keyword for dynamic servers. The slowstart task
is allocated in 'add server' handler if slowstart is used.
As the server is created in disabled state, there is no need to start
the task. The slowstart task will be automatically started on the first
'enable server' invocation.
'slowstart' can be used without check on a server, with the CLI handlers
'enable/disable server'. Move the code to initialize and start the
slowstart task outside of check.c.
This change will also be reused to enable slowstart for dynamic servers.
Allow to use the check related keywords defined in server.c. These
keywords can be enabled now that checks have been implemented for
dynamic servers.
Here is the list of the new keywords supported :
- error-limit
- observe
- on-error
- on-marked-down
- on-marked-up
Allow to configure ssl support for dynamic server checks independently
of the ssl server configuration. This is done via the keyword
"check-ssl". Also enable to configure the sni/alpn used for the check
via "check-sni/alpn".
The ssl context is not initialized for a dynamic server, even if there
is a tcpcheck rule which uses ssl on the related backed. This will cause
the check initialization to failed with the message :
"Out of memory when initializing an SSL connection"
This can be reproduced by having the following config in the backend :
option tcp-check
tcp-check connect ssl
and create a dynamic server with check activated and a ca-file.
Fix this by calling the prepare_srv xprt callback when the proxy options
PR_O_TCPCKH_SSL is set.
Check support for dynamic servers has been merged in the current branch.
No backport needed.
Test that checks have been configured on the server before enabling via
the 'enable health' CLI. This mirrors the 'enable agent' command.
Without this, a user can use the command on the server without checks.
This leaves the server in an undefined state. Notably, the stat page
reports the server in check transition.
This condition was left on the following reorg commit.
2c04eda8b5
REORG: cli: move "{enable|disable} health" to server.c
This should be backported up to 1.8.
The issue is introduced with the commit c41d8bd65 ("CLEANUP: flt-trace:
Remove unused random-parsing option").
This must be backported everywhere the above commit is.
appctx_new() is exclusively called with tid_bit and it only uses the
mask to pass it to the accompanying task. There is no point requiring
the caller to know about a mask there, nor is there any point in
creating an applet outside of the context of its own thread anyway.
Let's drop this and pass tid_bit to task_new() directly.
Ilya reports in GH #1392 that clang 13 complains about totlen being
calculated and not used in fd_write_frag_line(), which is true. It's
a leftover of some older code.
Ilya reports in GH #1392 that clang 13 complains about a flag being added
to the "flags" parameter without being used later. That's generic code
that was shared from TCP but we can indeed drop this flag since it's used
for TFO which we don't have in socketpairs.
The CLI's payload parser is over-complicated and as such contains more
bugs than needed. One of them is that it uses strstr() to find the
ending tag, ignoring spaces before it, while the argument locator
creates a new arg on each space, without checking if the end of the
word appears past the previously found end. This results in "<<" being
considered as the start of a new argument if preceeded by more than
one space, and the payload being damaged with a \0 inserted at the
first space or tab.
Let's make an easily backportable fix for now. This fix makes sure that
the trailing zero from the first line is properly kept after '<<' and
that the end tag is looked for only as an isolated argument and nothing
else. This also gets rid of the unsuitable strstr() call and now makes
sure that strcspn() will not return elements that are found in the
payload.
For the long term the loop must be rewritten to get rid of those
unsuitable strcspn() and strstr() calls which work past each other, and
the cli_parse_request() function should be split into a tokenizer and
an executor that are used from the caller instead of letting the caller
play games with what it finds there.
This should be backported wherever CLI payload is supported, i.e. 2.0+.
Move the code to allocate/free the mux cleanup task outside of the polling
loop. A new thread_alloc/free handler is registered for this in
connection.c.
This has the benefit to clean up the polling loop code. And as another
benefit, if the task allocation fails, the handler can report an error
to exit the haproxy process. This prevents a potential null pointer
dereferencing.
This should fix the github issue #1389.
This must be backported up to 2.4.
When the LDAP response is parsed, the message length is not properly
decoded. While it works for LDAP servers encoding it on 1 byte, it does not
work for those using a multi-bytes encoding. Among others, Active Directory
servers seems to encode messages or elements length on 4 bytes.
In this patch, we only handle length of BindResponse messages encoded on 1,
2 or 4 bytes. In theory, it may be encoded on any bytes number less than 127
bytes. But it is useless to make this part too complex. It should be ok this
way.
This patch should fix the issue #1390. It should be backported to all stable
versions. While it should be easy to backport it as far as 2.2, the patch
will have to be totally rewritten for lower versions.
Ilya reported in issue #1391 a build warning on Fedora about mallinfo()
being deprecated in favor of mallinfo2() since glibc-2.33. Let's add
support for it. This should be backported where the following commit is
also backported: 157e39303 ("MINOR: pools: automatically disable
malloc_trim() with external allocators").
If an error was already reported on the H1 connection, pending input data
must not be (re)evaluated in h1_process(). Otherwise an unexpected internal
error will be reported, in addition of the first one. And on some
conditions, this may generate an infinite loop because the mux tries to send
an internal error but it fails to do so thus it loops to retry.
This patch should fix the issue #1356. It must be backported to 2.4.
The "unresolved" variable is unused since commit 9fa0df5 ("BUG/MINOR: acl:
Fix freeing of expr->smp in prune_acl_expr").
This patch should fix the issue #1359.
Pierre Cheynier reported some occasional crashes in malloc_trim() on a
recent glibc when running with jemalloc(). While in theory there should
not be any link between the two, it remains plausible that something
allocated early with one is tentatively freed with the other and that
attempts to trim end up badly. There's no point calling the glibc specific
malloc_trim() with external allocators anyway. However these ones are often
enabled at link time or even at run time with LD_PRELOAD, so we cannot rely
on build options for this.
This patch implements runtime detection for the allocator in use by checking
with mallinfo() that a malloc() call is properly accounted for in glibc's
malloc. It only enables malloc_trim() in this case, and ignores it for
other cases. It's fine to proceed like this because mallinfo() is provided
by a wider range of glibcs than malloc_trim().
This could be backported to 2.4 and 2.3. If so, it will also need previous
patch "CLEANUP: pools: factor all malloc_trim() calls into trim_all_pools()".
The sizeof() was printed as a long but it's just an unsigned on some
32-bit platforms, hence the format warning. No backport is needed, as
this arrived in 2.5 with commit 40ca09c7b ("MINOR: sample: Add be2dec
converter").
A bug was introduced by the commit 26eb5ea35 ("BUG/MINOR: filters: Always
set FLT_END analyser when CF_FLT_ANALYZE flag is set"). Depending on the
channel evaluated, the rigth FLT_END analyser must be set. AN_REQ_FLT_END
for the request channel and AN_RES_FLT_END for the response one.
Ths patch must be backported everywhere the above commit was backported.
When an error is returned to the client, via a call to
http_reply_and_close(), the request channel is flushed and shut down and
HTTP analysis on both direction is finished. So it is safer to centralize
reset of channels analysers at this place. It is especially important when a
filter is attached to the stream when a client abort is detected. Because,
otherwise, the stream remains blocked because request analysers are not
reset.
This bug was hidden for a while. But since the fix 6fcd2d328 ("BUG/MINOR:
stream: Don't release a stream if FLT_END is still registered"), it is
possible to trigger it.
This patch must be backported everywhere the above commit was backported.
If the end of input is reported by the mux on the conn-stream during a
receive, we leave without evaluating the channel policies. It is especially
important to be able to catch client aborts during server connection
establishment. Indeed, in this case, without this patch, the
stream-interface remains blocked and read events are not forwarded to the
stream. It means it is not possible to detect client aborts.
Thanks to this fix, the abortonclose option should fixed for HAProxy 2.3 and
lower. On 2.4 and 2.5, it seems to work because the stream is created after
the request parsing.
Note that a previous fix of abortonclose option was reverted. This one
should be the right way to fix it. It must carefully be backported as far as
2.0. A observation period on the 2.3 is probably a good idea.
Now, "Upgrade:" header is removed from such requests. Thus, the condition to
reject them is now useless and can be removed. Code to handle unimplemented
features is now unused but is preserved for future uses.
This patch may be backported to 2.4.
Instead of returning a 501-Not-implemented error when "Ugrade:" header is
found for a request with a payload, the header is removed. This way, the
upgrade is disabled and the request is still sent to the server. It is
required because some frameworks seem to try to perform H2 upgrade on every
requests, including POST ones.
The h2 mux was slightly fixed to convert Upgrade requests to extended
connect ones only if the rigth HTX flag is set.
This patch should fix the issue #1381. It must be backported to 2.4.
The sole purpose of the variable's usage accounting is to enforce
limits at the session or process level, but very commonly these are not
set, yet the bookkeeping (especially at the process level) is extremely
expensive.
Let's simply disable it when the limits are not set. This further
increases the performance of 12 variables on 16-thread from 1.06M
to 1.24M req/s.
Right now we have a per-process max variable size and a per-scope one,
with the proc scope covering all others. As such, the per-process global
one is always exactly equal to the per-proc-scope one. And bookkeeping
on these process-wide variables is extremely expensive (up to 38% CPU
seen in var_accounting_diff() just for them).
Let's kill vars_global_size and only rely on the proc one. Doing this
increased the request rate from 770k to 1.06M in a config having only
12 variables on a 16-thread machine.
The global table of known variables names can only grow and was designed
for static names that are registered at boot. Nowadays it's possible to
set dynamic variable names from Lua or from the CLI, which causes a real
problem that was partially addressed in 2.2 with commit 4e172c93f
("MEDIUM: lua: Add `ifexist` parameter to `set_var`"). Please see github
issue #624 for more context.
This patch simplifies all this by removing the need for a central
registry of known names, and storing 64-bit hashes instead. This is
highly sufficient given the low number of variables in each context.
The hash is calculated using XXH64() which is bijective over the 64-bit
space thus is guaranteed collision-free for 1..8 chars. Above that the
risk remains around 1/2^64 per extra 8 chars so in practice this is
highly sufficient for our usage. A random seed is used at boot to seed
the hash so that it's not attackable from Lua for example.
There's one particular nit though. The "ifexist" hack mentioned above
is now limited to variables of scope "proc" only, and will only match
variables that were already created or declared, but will now verify
the scope as well. This may affect some bogus Lua scripts and SPOE
agents which used to accidentally work because a similarly named
variable used to exist in a different scope. These ones may need to be
fixed to comply with the doc.
Now we can sum up the situation as this one:
- ephemeral variables (scopes sess, txn, req, res) will always be
usable, regardless of any prior declaration. This effectively
addresses the most problematic change from the commit above that
in order to work well could have required some script auditing ;
- process-wide variables (scope proc) that are mentioned in the
configuration, referenced in a "register-var-names" SPOE directive,
or created via "set-var" in the global section or the CLI, are
permanent and will always accept to be set, with or without the
"ifexist" restriction (SPOE uses this internally as well).
- process-wide variables (scope proc) that are only created via a
set-var() tcp/http action, via Lua's set_var() calls, or via an
SPOE with the "force-set-var" directive), will not be permanent
but will always accept to be replaced once they are created, even
if "ifexist" is present
- process-wide variables (scope proc) that do not exist will only
support being created via the set-var() tcp/http action, Lua's
set_var() calls without "ifexist", or an SPOE declared with
"force-set-var".
This means that non-proc variables do not care about "ifexist" nor
prior declaration, and that using "ifexist" should most often be
reliable in Lua and that SPOE should most often work without any
prior declaration. It may be doable to turn "ifexist" to 1 by default
in Lua to further ease the transition. Note: regtests were adjusted.
Cc: Tim Dsterhus <tim@bastelstu.be>
Variables names will be hashed, but for this we need a random seed.
The XXH3() algorithms is bijective over the whole 64-bit space, which
is great as it guarantees no collision for 1..8 byte names. But above
that even if the risk is extremely faint, it theoretically exists and
since variables may be set from Lua we'd rather do our best to limit
the risk of controlled collision, hence the random seed.
All variables whose names are parsed by the config parser, the
command-line parser or the SPOE's register-var-names parser are
now preset as permanent. This will guarantee that these variables
will exist through out all the process' life, and that it will be
possible to implement the "ifexist" feature by looking them up.
This was marked medium because pre-setting a variable with an empty
value may always have side effects, even though none was spotted at
this stage.
We certainly do not want that a permanent variable (one that is listed
in the configuration) be erased by accident by an "unset-var" action.
Let's make sure these ones are only reset to an empty sample, like at
the moment of their initial registration. One trick is that the same
function is used to purge the memory at the end and to delete, so we
need to add an extra "force" argument to make the choice.
In order to continue to honor the ifexist Lua option and prevent rogue
SPOA agents from creating too many variables, we'll need to keep the
ability to mark certain proc.* variables as permanent when they're
known from the config file.
Let's add a flag there for this. It's added to the variable when the
variable is created with this flag set by the caller.
Another approach could have been to use a distinct list or distinct
scope but that sounds complicated and bug-prone.
Storing an unset sample (SMP_T_ANY == 0) will be used to only reserve
the variable's space but associate no value. We need to slightly adjust
var_to_smp() for this so that it considers a value-less variable as non
existent and falls back to the default value.
Passing this flag to var_set() will result in the variable to only be
created if it did not exist, otherwise nothing is done (it's not even
updated). This will be used for pre-registering names.
When setting variables, there are currently two variants, one which will
always create the variable, and another one, "ifexist", which will only
create or update a variable if a similarly named variable in any scope
already existed before.
The goal was to limit the risk of injecting random names in the proc
scope, but it was achieved by making use of the somewhat limited name
indexing model, which explains the scope-agnostic restriction.
With this change, we're moving the check downwards in the chain, at the
variable level, and only variables under the scope "proc" will be subject
to the restriction. A new set of VF_* flags was added to adjust how
variables are set, and VF_UPDATEONLY is used to mention this restriction.
In this exact state of affairs, this is not completely exact, as if a
similar name was not known in any scope, the variable will continue to
be rejected like before, but this will change soon.
The names for these two functions are totally misleading, they have
nothing to do with samples, they're purely dedicated to variables. The
former is only used by the second one and makes no sense by itself, so
it cannot even get a meaningful name. Let's remerge them into a single
one called "var_set()" which, as its name tries to imply, sets a variable
to a given value.
This name was quite misleading, as it has nothing to do with samples nor
streams. This function's sole purpose is to unset a variable, so let's
call it "var_unset()" and document it a little bit.
The vars_init() name is particularly confusing as it does not initialize
the variables code but the head of a list of variables passed in
arguments. And we'll soon need to have proper initialization code, so
let's rename it now.
In ticket #1348 some users expressed some concerns regarding the removal
of the "grace" directive from the proxies. Their use case very closely
mimmicks the original intent of the grace keyword, which is, let haproxy
accept traffic for some time when stopping, while indicating an external
LB that it's stopping.
This is implemented here by starting a task whose expiration triggers
the soft-stop for real. The global "stopping" variable is immediately
set however. For example, this below will be sufficient to instantly
notify an external check on port 9999 that the service is going down,
while other services remain active for 10s:
global
grace 10s
frontend ext-check
bind :9999
monitor-uri /ext-check
monitor fail if { stopping }
This reverts commit e0dec4b7b2.
At first glance, channel_is_empty() was used on purpose in si_update_rx(),
because of the HTX ("b3e0de46c" MEDIUM: stream-int: Rely only on
SI_FL_WAIT_ROOM to stop data receipt). It is not pretty clear for now why
channel_may_recv() sould not be used here but this change introduce a
possible infinite loop with the stats applet. So, it is safer to revert the
patch, waiting for a better understanding of the probelm.
This means the abortonclose option will be broken again on the 2.3 and lower
versions.
This patch should fix the issue #1360. It must be backported as far as 2.0.
Since commit "BUG/MINOR: config: reject configs using HTTP with bufsize
>= 256 MB" we are now sure that it's not possible anymore to have an HTX
block of a size 256 MB or more, even after concatenation thanks to the
tests for len >= htx_free_data_space(). Let's remove these now obsolete
comments.
A BUG_ON() was added in htx_add_blk() to track any such exception if
the conditions would change later, to complete the one that is performed
on the start address that must remain within the buffer.
As seen in commit 5ef965606 ("BUG/MINOR: lua: use strlcpy2() not
strncpy() to copy sample keywords"), configs with large values of
tune.bufsize were not practically usable since Lua was introduced,
regardless of the machine's available memory.
In addition, HTX encoding already limits block sizes to 256 MB, thus
it is not technically possible to use that large a buffer size when
HTTP is in use. This is absurdly high anyway, and for example Lua
initialization would take around one minute on a 4 GHz CPU. Better
prevent such a config from starting than having to deal with bug
reports that make no sense.
The check is only enforced if at least one HTX proxy was found, as
there is no techincal reason to block it for configs that are solely
based on raw TCP, and it could still be imagined that some such might
exist with single connections (e.g. a log forwarder that buffers to
cover for the storage I/O latencies).
This should be backported to all HTX-enabled versions (2.0 and above).
It is quite common to see in configurations constructions like the
following one:
http-request set-var(txn.bodylen) 0
http-request set-var(txn.bodylen) req.hdr(content-length)
...
http-request set-header orig-len %[var(txn.bodylen)]
The set-var() rules are almost always duplicated when manipulating
integers or any other value that is mandatory along operations. This is
a problem because it makes the configurations complicated to maintain
and slower than needed. And it becomes even more complicated when several
conditions may set the same variable because the risk of forgetting to
initialize it or to accidentally reset it is high.
This patch extends the var() sample fetch function to take an optional
argument which contains a default value to be returned if the variable
was not set. This way it becomes much simpler to use the variable, just
set it where needed, and read it with a fall back to the default value:
http-request set-var(txn.bodylen) req.hdr(content-length)
...
http-request set-header orig-len %[var(txn.bodylen,0)]
The default value is always passed as a string, thus it will experience
a cast to the output type. It doesn't seem userful to complicate the
configuration to pass an explicit type at this point.
The vars.vtc regtest was updated accordingly.
In preparation for support default values when fetching variables, we
need to update the internal API to pass an extra argument to functions
vars_get_by_{name,desc} to provide an optional default value. This
patch does this and always passes NULL in this argument. var_to_smp()
was extended to fall back to this value when available.
The two functions vars_get_by_name() and vars_get_by_scope() perform
almost the same operations except that they differ from the way the
name and scope are retrieved. The second part in common is more
complex and involves locking, so better factor this one out into a
new function.
There is no other change than refactoring.
Most often "set var" on the CLI is used to set a string, and using only
expressions is not always convenient, particularly when trying to
concatenate variables sur as host names and paths.
Now the "set var" command supports an optional keyword before the value
to indicate its type. "expr" takes an expression just like before this
patch, and "fmt" a format string, making it work like the "set-var-fmt"
actions.
The VTC was updated to include a test on the format string.
Just like the set-var-fmt action for tcp/http rules, the set-var-fmt
directive in global sections allows to pre-set process-wide variables
using a format string instead of a sample expression. This is often
more convenient when it is required to concatenate multiple fields,
or when emitting just one word.
The log-format strings are usable at plenty of places, but the expressions
using %[] were restricted to request or response context and nothing else.
This prevents from using them from the config context or the CLI, let's
relax this.
We're using a dummy temporary proxy when creating global variables in
the configuration file, it was copied from the CLI's code and was
mistakenly called "CLI", better name it "CFG". It should not appear
anywhere except maybe when debugging cores.
When attempting to set a variable does not start with the "proc" scope on
the CLI, we used to emit "only proc is permitted in the global section"
which obviously is a leftover from the initial code.
This may be backported to 2.4.
When a variable starts with the wrong scope, it is named without stripping
the extra characters that follow it, which usually are closing parenthesis.
Let's make sure we only report what is expected.
This may be backported to 2.4.
In commit 9a621ae76 ("MEDIUM: vars: add a new "set-var-fmt" action")
we introduced the support for format strings in variables with the
ability to release them on exit, except that it's the wrong list that
was being scanned for the rule (http vs vars), resulting in random
crashes during deinit.
This was a recent commit in 2.5-dev, no backport is needed.
The set-var() action is convenient because it preserves the input type
but it's a pain to deal with when trying to concatenate values. The
most recurring example is when it's needed to build a variable composed
of the source address and the source port. Usually it ends up like this:
tcp-request session set-var(sess.port) src_port
tcp-request session set-var(sess.addr) src,concat(":",sess.port)
This is even worse when trying to aggregate multiple fields from stick-table
data for example. Due to this a lot of users instead abuse headers from HTTP
rules:
http-request set-header(x-addr) %[src]:%[src_port]
But this requires some careful cleanups to make sure they won't leak, and
it's significantly more expensive to deal with. And generally speaking it's
not clean. Plus it must be performed for each and every request, which is
expensive for this common case of ip+port that doesn't change for the whole
session.
This patch addresses this limitation by implementing a new "set-var-fmt"
action which performs the same work as "set-var" but takes a format string
in argument instead of an expression. This way it becomes pretty simple to
just write:
tcp-request session set-var-fmt(sess.addr) %[src]:%[src_port]
It is usable in all rulesets that already support the "set-var" action.
It is not yet implemented for the global "set-var" directive (which already
takes a string) and the CLI's "set var" command, which would definitely
benefit from it but currently uses its own parser and engine, thus it
must be reworked.
The doc and regtests were updated.
When the expression called in "set-var" uses argments that require late
resolution, the context must be set. At the moment, any unknown argument
is misleadingly reported as "ACL":
frontend f
bind :8080
mode http
http-request set-var(proc.a) be_conn(foo)
parsing [b1.cfg:4]: unable to find backend 'foo' referenced in arg 1 \
of ACL keyword 'be_conn' in proxy 'f'.
Once the context is properly set, it now says the truth:
parsing [b1.cfg:8]: unable to find backend 'foo' referenced in arg 1 \
of sample fetch keyword 'be_conn' in http-request expression in proxy 'f'.
This may be backported but is not really important. If so, the preceeding
patches "BUG/MINOR: vars: improve accuracy of the rules used to check
expression validity" and "MINOR: sample: add missing ARGC_ entries" must
be backported as well.
For a long time we couldn't have arguments in expressions used in
tcp-request, tcp-response etc rules. But now due to the variables
it's possible, and their context in case of failure to resolve an
argument (e.g. backend name not found) is not properly reported
because there is no arg context values in ARGC_* to report them.
Let's add a number of missing ones for tcp-request {connection,
session,content}, tcp-response content, tcp-check, the config
parser (for "set-var" in the global section) and the CLI parser
(for "set-var" on the CLI).
The set-var() expression naturally checks whether expressions are valid
in the context of the rule, but it fails to differentiate frontends from
backends. As such for tcp-content and http-request rules, it will only
accept frontend-compatible sample-fetches, excluding those declared with
SMP_UES_BKEND (a few such as be_id, be_name). For the response it accepts
the backend-compatible expressions only, though it seems that there are
no sample-fetch function that are valid only in the frontend's content,
so that should not cause any problem.
Note that while allowing valid configs to be used, the fix might also
uncover some incorrect configurations where some expressions currently
return nothing (e.g. something depending on frontend declared in a
backend), and which could be rejected, but there does not seem to be
any such keyword. Thus while it should be backported, better not backport
it too far (2.4 and possibly 2.3 only).
The parser checks first for "set-var" then "unset-var" from the updated
offset instead of testing it only when the other one fails, so it
validates this rule as "unset-var":
http-request set-varunset-var(proc.a)
This should be backported everywhere relevant, though it's mostly harmless
as it's unlikely that some users are purposely writing this in their conf!
Sometimes it is convenient to remap large sets of URIs to new ones (e.g.
after a site migration for example). This can be achieved using
"http-request redirect" combined with maps, but one difficulty there is
that non-matching entries will return an empty response. In order to
avoid this, duplicating the operation as an ACL condition ending in
"-m found" is possible but it becomes complex and error-prone while it's
known that an empty URL is not valid in a location header.
This patch addresses this by improving the redirect rules to be able to
simply ignore the rule and skip to the next one if the result of the
evaluation of the "location" expression is empty. However in order not
to break existing setups, it requires a new "ignore-empty" keyword.
There used to be an ACT_FLAG_FINAL on redirect rules that's used during
the parsing to emit a warning if followed by another rule, so here we
only set it if the option is not there. The http_apply_redirect_rule()
function now returns a 3rd value to mention that it did nothing and
that this was not an error, so that callers can just ignore the rule.
The regular "redirect" rules were not modified however since this does
not apply there.
The map_redirect VTC was completed with such a test and updated to 2.5
and an example was added into the documentation.
The bc_conn_err and bc_conn_err_str sample fetches give the status of
the connection on the backend side. The error codes and error messages
are the same than the ones that can be raised by the fc_conn_err fetch.
This new sample fetch along the ssl_bc_hsk_err_str fetch contain the
last SSL error of the error stack that occurred during the SSL
handshake (from the backend's perspective).
The locking in the dequeuing process was significantly improved by commit
49667c14b ("MEDIUM: queue: take the proxy lock only during the px queue
accesses") in that it tries hard to limit the time during which the
proxy's queue lock is held to the strict minimum. Unfortunately it's not
enough anymore, because we take up the task and manipulate a few pendconn
elements after releasing the proxy's lock (while we're under the server's
lock) but the task will not necessarily hold the server lock since it may
not have successfully found one (e.g. timeout in the backend queue). As
such, stream_free() calling pendconn_free() may release the pendconn
immediately after the proxy's lock is released while the other thread
currently proceeding with the dequeuing tries to wake up the owner's
task and dies in task_wakeup().
One solution consists in releasing le proxy's lock later. But tests have
shown that we'd have to sacrifice a significant share of the performance
gained with the patch above (roughly a 20% loss).
This patch takes another approach. It adds a "del_lock" to each pendconn
struct, that allows to keep it referenced while the proxy's lock is being
released. It's mostly a serialization lock like a refcount, just to maintain
the pendconn alive till the task_wakeup() call is complete. This way we can
continue to release the proxy's lock early while keeping this one. It had
to be added to the few points where we're about to free a pendconn, namely
in pendconn_dequeue() and pendconn_unlink(). This way we continue to
release the proxy's lock very early and there is no performance degradation.
This lock may only be held under the queue's lock to prevent lock
inversion.
No backport is needed since the patch above was merged in 2.5-dev only.
This option can be used to define a specific log format that will be
used in case of error, timeout, connection failure on a frontend... It
will be used for any log line concerned by the log-separate-errors
option. It will also replace the format of specific error messages
decribed in section 8.2.6.
If no "error-log-format" is defined, the legacy error messages are still
emitted and the other error logs keep using the regular log-format.
This option will be replaced by a "error-log-format" that enables to use
a dedicated log-format for connection error messages instead of the
regular log-format (in which most of the fields would be invalid in such
a case).
The "log-error-via-logformat" mechanism will then be replaced by a test
on the presence of such an error log format or not. If a format is
defined, it is used for connection error messages, otherwise the legacy
error log format is used.
One was in backend.c and the other one in hlua.c. No other candidate
was found with "git grep '^#if\s*USE'". It's worth noting that 3
other such tests exist for SSL_OP_NO_{SSLv3,TLSv1_1,TLSv1_2} but
that these ones are properly set to 0 in openssl-compat.h when not
defined.
The condition should first check whether `bsize` is reached, before
dereferencing the offset. Even if this always works fine, due to the
string being null-terminated, this certainly looks odd.
Found using GitHub's CodeQL scan.
This bug traces back to at least 97c2ae13bc
(1.7.0+) and this patch should be backported accordingly.
Using localtime / gmtime is not thread-safe, whereas the `get_*` wrappers are.
Found using GitHub's CodeQL scan.
The use in sample_conv_ltime() can be traced back to at least
fac9ccfb70 (first appearing in 1.6-dev3), so all
supported branches with thread support are affected.
The test on FIND_OPTIMAL_MATCH for the experimental code can yield a
build warning when using -Wundef, let's turn it into a regular ifdef.
This is slz upstream commit 05630ae8f22b71022803809eb1e7deb707bb30fb
Before threads were introduced in 1.8, idle_pct used to be a global
variable indicating the overall process idle time. Threads made it
thread-local, meaning that its reporting in the stats made little
sense, though this was not easy to spot. In 2.0, the idle_pct variable
moved to the struct thread_info via commit 81036f273 ("MINOR: time:
move the cpu, mono, and idle time to thread_info"). It made it more
obvious that the idle_pct was per thread, and also allowed to more
accurately measure it. But no more effort was made in that direction.
This patch introduces a new report_idle() function that accurately
averages the per-thread idle time over all running threads (i.e. it
should remain valid even if some threads are paused or stopped), and
makes use of it in the stats / "show info" reports.
Sending traffic over only two connections of an 8-thread process
would previously show this erratic CPU usage pattern:
$ while :; do socat /tmp/sock1 - <<< "show info"|grep ^Idle;sleep 0.1;done
Idle_pct: 30
Idle_pct: 35
Idle_pct: 100
Idle_pct: 100
Idle_pct: 100
Idle_pct: 100
Idle_pct: 100
Idle_pct: 100
Idle_pct: 35
Idle_pct: 33
Idle_pct: 100
Idle_pct: 100
Idle_pct: 100
Idle_pct: 100
Idle_pct: 100
Idle_pct: 100
Now it shows this more accurate measurement:
$ while :; do socat /tmp/sock1 - <<< "show info"|grep ^Idle;sleep 0.1;done
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
This is not technically a bug but this lack of precision definitely affects
some users who rely on the idle_pct measurement. This should at least be
backported to 2.4, and might be to some older releases depending on users
demand.
To be able to provide JA3 compatible TLS Fingerprints we need to expose
all Client Hello captured data using fetchers. Patch provides new
and modifies existing fetchers to add ability to filter out GREASE values:
- ssl_fc_cipherlist_*
- ssl_fc_ecformats_bin
- ssl_fc_eclist_bin
- ssl_fc_extlist_bin
- ssl_fc_protocol_hello_id
When we set tune.ssl.capture-cipherlist-size to a non-zero value
we are able to capture cipherlist supported by the client. To be able to
provide JA3 compatible TLS fingerprinting we need to capture more
information from Client Hello message:
- SSL Version
- SSL Extensions
- Elliptic Curves
- Elliptic Curve Point Formats
This patch allows HAProxy to capture such information and store it for
later use.
The lua initialization code which creates the Lua mapping of all converters
and sample fetch keywords makes use of strncpy(), and as such can take ages
to start with large values of tune.bufsize because it spends its time zeroing
gigabytes of memory for nothing. A test performed with an extreme value of
16 MB takes roughly 4 seconds, so it's possible that some users with huge
1 MB buffers (e.g. for payload analysis) notice a small startup latency.
However this does not affect config checks since the Lua stack is not yet
started. Let's replace this with strlcpy2().
This should be backported to all supported versions.
When a server is configured with name-resolution, resolvers objects are
created with reference to this server. Thus the server is marked as non
purgeable to prevent its removal at runtime.
This does not need to be backport.
Patch 211c967 ("MINOR: httpclient: add the server to the proxy") broke
the reg-tests that do a "show servers state".
Indeed the servers of the proxies flagged with PR_CAP_INT are dumped in
the output of this CLI command.
This patch fixes the issue par ignoring the PR_CA_INT proxies in the
dump.
Without this fix, the decode function would proceed even when the output
buffer is not large enough, because the padding was not considered. For
example, it would not fail with the input length of 23 and the output
buffer size of 15, even the actual decoded output size is 17.
This patch should be backported to all stable branches that have a
base64urldec() function available.
Relax the condition on "delete server" CLI handler to be able to remove
all servers, even non dynamic, except if they are flagged as non
purgeable.
This change is necessary to extend the use cases for dynamic servers
with reload. It's expected that each dynamic server created via the CLI
is manually commited in the haproxy configuration by the user. Dynamic
servers will be present on reload only if they are present in the
configuration file. This means that non-dynamic servers must be allowed
to be removable at runtime.
The dynamic servers removal reg-test has been updated and renamed to
reflect its purpose. A new test is present to check that non-purgeable
servers cannot be removed.
Mark servers that are referenced by configuration elements as non
purgeable. This includes the following list :
- tracked servers
- servers referenced in a use-server rule
- servers referenced in a sample fetch
In a future patch, it will be possible to remove at runtime every
servers, both static and dynamic. This requires to extend the server
refcount for all instances.
First, refcount manipulation functions have been renamed to better
express the API usage.
* srv_refcount_use -> srv_take
The refcount is always initialize to 1 on the server creation in
new_server. It's also incremented for each check/agent configured on a
server instance.
* free_server -> srv_drop
This decrements the refcount and if null, the server is freed, so code
calling it must not use the server reference after it. As a bonus, this
function now returns the next server instance. This is useful when
calling on the server loop without having to save the next pointer
before each invocation.
In these functions, remove the checks that prevent refcount on
non-dynamic servers. Each reference to "dynamic" in variable/function
naming have been eliminated as well.
A dynamic server may be deleted at runtime at the same moment when the
stats applet is pointing to it. Use the server refcount to prevent
deletion in this case.
This should be backported up to 2.4, with an observability period of 2
weeks. Note that it requires the dynamic server refcounting feature
which has been implemented on 2.5; the following commits are required :
- MINOR: server: implement a refcount for dynamic servers
- BUG/MINOR: server: do not use refcount in free_server in stopping mode
- MINOR: server: return the next srv instance on free_server
As a convenience, return the next server instance from servers list on
free_server.
This is particularily useful when using this function on the servers
list without having to save of the next pointer before calling it.
using the procctl api to set the current process as traceable, thus being able to produce a core dump as well.
making it as compile option if not wished or using freebsd prior to 11.x (last no EOL release).
THe http_update_update_host function takes an URL and extract the domain
to use as a host header. However it only update an existing host header
and does not create one.
This patch add an empty host header so the function can update it.
Add the raw and ssl server to the proxy list so they can be freed during
the deinit() of HAProxy. As a side effect the 2 servers need to have a
different ID so the SSL one was renamed "<HTTPSCLIENT>".
Ensure that no more than olen bytes is written to the output buffer,
otherwise we might experience an unexpected behavior.
While the original code used to validate that the output size was
always large enough before starting to write, this validation was
later broken by the commit below, allowing to 3-byte blocks to
areas whose size is not multiple of 3:
commit ed697e4856
Author: Emeric Brun <ebrun@haproxy.com>
Date: Mon Jan 14 14:38:39 2019 +0100
BUG/MINOR: base64: dec func ignores padding for output size checking
Decode function returns an error even if the ouptut buffer is
large enought because the padding was not considered. This
case was never met with current code base.
For base64urldec(), it's basically the same problem except that since
the input format supports arbitrary lengths, the problem has always
been there since its introduction in 2.4.
This should be backported to all stable branches having a backport of
the patch above (i.e. 2.0), with some adjustments depending on the
availability of the base64dec() and base64urldec().
The httpclient does a free of the servers and proxies it uses, however
since we are including them in the global proxy list, haproxy already
free them during the deinit. We can safely remove these free.
The sc-set-gpt0() parser was extended in 2.1 by commit 0d7712dff ("MINOR:
stick-table: allow sc-set-gpt0 to set value from an expression") to support
sample expressions in addition to plain integers. However there is a
subtlety there, which is that while the arg position must be incremented
when parsing an integer, it must not be touched when calling an expression
since the expression parser already does it.
The effect is that rules making use of sc-set-gpt0() followed by an
expression always ignore one word after that expression, and will typically
fail to parse if followed by an "if" as the parser will restart after the
"if". With no condition it's different because an empty condition doesn't
result in trying to parse anything.
This patch moves the increment at the right place and adds a few
explanations for a code part that was far from being obvious.
This should be backported to branches having the commit above (2.1+).
Implements a way of checking the running openssl version:
If the OpenSSL support was not compiled within HAProxy it will returns a
error, so it's recommanded to do a SSL feature check before:
$ ./haproxy -cc 'feature(OPENSSL) && openssl_version_atleast(0.9.8zh) && openssl_version_before(3.0.0)'
This will allow to select the SSL reg-tests more carefully.
Some users are facing huge CPU usage or even watchdog panics due to
the Lua global lock when many threads compete on it, but they have
no way to see that in the usual dumps. We take the lock at 2 or 3
places only, thus it's trivial to move it to a global function so
that stack dumps will now explicitly show it, increasing the change
that it rings a bell and someone suggests switch to lua-load-per-thread:
Current executing Lua from a stream analyser -- stack traceback:
loop.lua:1: in function line 1
call trace(27):
| 0x5ff157 [48 83 c4 10 5b 5d 41 5c]: wdt_handler+0xf7/0x104
| 0x7fe37fe82690 [48 c7 c0 0f 00 00 00 0f]: libpthread:+0x13690
| 0x614340 [66 48 0f 7e c9 48 01 c2]: main+0x1e8a40
| 0x607b85 [48 83 c4 08 48 89 df 31]: main+0x1dc285
| 0x6070bc [48 8b 44 24 20 48 8b 14]: main+0x1db7bc
| 0x607d37 [41 89 c4 89 44 24 1c 83]: lua_resume+0xc7/0x214
| 0x464ad6 [83 f8 06 0f 87 f1 01 00]: main+0x391d6
| 0x4691a7 [83 f8 06 0f 87 03 20 fc]: main+0x3d8a7
| 0x51dacb [85 c0 74 61 48 8b 5d 20]: sample_process+0x4b/0xf7
| 0x51e55c [48 85 c0 74 3f 64 48 63]: sample_fetch_as_type+0x3c/0x9b
| 0x525613 [48 89 c6 48 85 c0 0f 84]: sess_build_logline+0x2443/0x3cae
| 0x4af0be [4c 63 e8 4c 03 6d 10 4c]: http_apply_redirect_rule+0xbfe/0xdf8
| 0x4af523 [83 f8 01 19 c0 83 e0 03]: main+0x83c23
| 0x4b2326 [83 f8 07 0f 87 99 00 00]: http_process_req_common+0xf6/0x15f6
| 0x4d5b30 [85 c0 0f 85 9f f5 ff ff]: process_stream+0x2010/0x4e18
It also allows "perf top" to directly show the time spent on this lock.
This may be backported to some stable versions as it improves the
overall debuggability.
Include the correct .h files in http_client.c and http_client.h.
The api.h is needed in http_client.c and http_client-t.h is now include
directly from http_client.h
Reported by coverity in ticket #1355
CID 1461505: Memory - illegal accesses (UNINIT)
Using uninitialized value "sl".
Fix the problem by initializing sl to NULL.
Proxies must call proxy_preset_defaults() to initialize their settings
that are usually learned from defaults sections (e.g. connection retries,
pool purge delay etc). At the moment there was likely no impact, but not
doing so could cause trouble soon when using the client more extensively
or when new defaults are introduced and failed to be initialized.
No backport is needed.
Recent commit 83614a9fb ("MINOR: httpclient: initialize the proxy") broke
reg tests that match the output of "show stats" or "show servers state"
because it changed the proxies' numeric ID.
In fact it did nothing wrong, it just registers a proxy and adds it at
the head of the list. But the automatic numbering scheme, which was made
to make sure that temporarily disabled proxies in the config keep their
ID instead of shifting all others, sees one more proxy and increments
next_pxid for all subsequent proxies.
This patch avoids this by not assigning automatic IDs to such internal
proxies, leaving them with their ID of -1, and by not shifting next_pxid
for them. This is important because the user might experience them
appearing or disappearing depending on apparently unrelated config
options or build options, and this must not cause visible proxy IDs
to change (e.g. stats or minitoring may break).
Though the issue has always been there, it only became a problem with
the recent proxy additions so there is no need to backport this.
The X509_STORE_CTX_get0_cert did not exist yet on OpenSSL 1.0.2 and
neither did X509_STORE_CTX_get0_chain, which was not actually needed
since its get1 equivalent already existed.
RFC7540 states that :path follows RFC3986's path-absolute. However
that was a bug introduced in the spec between draft 04 and draft 05
of the spec, which implicitly causes paths starting with "//" to be
forbidden. HTTP/1 (and now HTTP core semantics) made it explicit
that the request-target in origin-form follows a purposely defined
absolute-path defined as 1*(/ segment) to explicitly allow "//".
http2bis now fixes this by relying on absolute-path so that "//"
becomes valid and matches other versions. Full discussion here:
https://lists.w3.org/Archives/Public/ietf-http-wg/2021JulSep/0245.html
This issue appeared in haproxy with commit 4b8852c70 ("BUG/MAJOR: h2:
verify that :path starts with a '/' before concatenating it") when
making the checks on :path fully comply with the spec, and was backported
as far as 2.0, so this fix must be backported there as well to allow
"//" in H2 again.
Most of the SSL sample fetches related to the client certificate were
based on the SSL_get_peer_certificate function which returns NULL when
the verification process failed. This made it impossible to use those
fetches in a log format since they would always be empty.
The patch adds a reference to the X509 object representing the client
certificate in the SSL structure and makes use of this reference in the
fetches.
The reference can only be obtained in ssl_sock_bind_verifycbk which
means that in case of an SSL error occurring before the verification
process ("no shared cipher" for instance, which happens while processing
the Client Hello), we won't ever start the verification process and it
will be impossible to get information about the client certificate.
This patch also allows most of the ssl_c_XXX fetches to return a usable
value in case of connection failure (because of a verification error for
instance) by making the "conn->flags & CO_FL_WAIT_XPRT" test (which
requires a connection to be established) less strict.
Thanks to this patch, a log-format such as the following should return
usable information in case of an error occurring during the verification
process :
log-format "DN=%{+Q}[ssl_c_s_dn] serial=%[ssl_c_serial,hex] \
hash=%[ssl_c_sha1,hex]"
It should answer to GitHub issue #693.
Change the User-Agent from "HAProxy HTTP client" to "HAProxy" as the
previous name is not valid according to RFC 7231#5.5.3.
This patch fixes issue #1354.
This commit implements an HTTP Client over the CLI, this was made as
working example for the HTTP Client API.
It usable over the CLI by specifying a method and an URL:
echo "httpclient GET http://127.0.0.1:8000/demo.file" | socat /tmp/haproxy.sock -
Only IP addresses are accessibles since the API does not allow to
resolve addresses yet.
This commit implements a very simple HTTP Client API.
A client can be operated by several functions:
- httpclient_new(), httpclient_destroy(): create
and destroy the struct httpclient instance.
- httpclient_req_gen(): generate a complete HTX request using the
the absolute URL, the method and a list of headers. This request
is complete and sets the HTX End of Message flag. This is limited
to small request we don't need a body.
- httpclient_start() fill a sockaddr storage with a IP extracted
from the URL (it cannot resolve an fqdm for now), start the
applet. It also stores the ptr of the caller which could be an
appctx or something else.
- hc->ops contains a list of callbacks used by the
HTTPClient, they should be filled manually after an
httpclient_new():
* res_stline(): the client received a start line, its content
will be stored in hc->res.vsn, hc->res.status, hc->res.reason
* res_headers(): the client received headers, they are stored in
hc->res.hdrs.
* res_payload(): the client received some payload data, they are
stored in the hc->res.buf buffer and could be extracted with the
httpclient_res_xfer() function, which takes a destination buffer
as a parameter
* res_end(): this callback is called once we finished to receive
the response.
Initialize a proxy which contain a server for the raw HTTP, and another
one for the HTTPS. This proxy will use the global server log definition
and the 'option httplog' directive.
This proxy is internal and will only be used for the HTTP Client API.
The wording regarding Host vs :authority in RFC7540 is ambiguous as it
says that an intermediary must produce a host header from :authority if
Host is missing, but, contrary to HTTP/1.1, doesn't say anything regarding
the possibility that Host and :authority differ, which leaves Host with
higher precedence there. In addition it mentions that clients should use
:authority *instead* of Host, and that H1->H2 should use :authority only
if the original request was in authority form. This leaves some gray
area in the middle of the chain for fully valid H2 requests arboring a
Host header that are forwarded to the other side where it's possible to
drop the Host header and use the authority only after forwarding to a
second H2 layer, thus possibly seeing two different values of Host at
a different stage. There's no such issue when forwarding from H2 to H1
as the authority is dropped only only the Host is kept.
Note that the following request is sufficient to re-normalize such a
request:
http-request set-header host %[req.hdr(host)]
The new spec in progress (draft-ietf-httpbis-http2bis-03) addresses
this trouble by being a bit is stricter on these rules. It clarifies
that :authority must always be used instead of Host and that Host ought
to be ignored. This is much saner as it avoids to convey two distinct
values along the chain. This becomes the protocol-level equivalent of:
http-request set-uri %[url]
So this patch does exactly this, which we were initially a bit reluctant
to do initially by lack of visibility about other implementations'
expectations. In addition it slightly simplifies the Host header field
creation by always placing it first in the list of headers instead of
last; this could also speed up the look up a little bit.
This needs to be backported to 2.0. Non-HTX versions are safe regarding
this because they drop the URI during the conversion to HTTP/1.1 so
only Host is used and transmitted.
Thanks to Tim Dsterhus for reporting that one.
Before HTX was introduced, all the HTTP request elements passed in
pseudo-headers fields were used to build an HTTP/1 request whose syntax
was then scrutinized by the HTTP/1 parser, leaving no room to inject
invalid characters.
While NUL, CR and LF are properly blocked, it is possible to inject
spaces in the method so that once translated to HTTP/1, fields are
shifted by one spcae, and a lenient HTTP/1 server could possibly be
fooled into using a part of the method as the URI. For example, the
following request:
H2 request
:method: "GET /admin? HTTP/1.1"
:path: "/static/images"
would become:
GET /admin? HTTP/1.1 /static/images HTTP/1.1
It's important to note that the resulting request is *not* valid, and
that in order for this to be a problem, it requires that this request
is delivered to an already vulnerable HTTP/1 server.
A workaround here is to reject malformed methods by placing this rule
in the frontend or backend, at least before leaving haproxy in H1:
http-request reject if { method -m reg [^A-Z0-9] }
Alternately H2 may be globally disabled by commenting out the "alpn"
directive on "bind" lines, and by rejecting H2 streams creation by
adding the following statement to the global section:
tune.h2.max-concurrent-streams 0
This patch adds a check for each character of the method to make sure
they belong to the ones permitted in a token, as mentioned in RFC7231#4.1.
This should be backported to versions 2.0 and above. For older versions
not having HTX_FL_PARSING_ERROR, a "goto fail" works as well as it
results in a protocol error at the stream level. Non-HTX versions are
safe because the resulting invalid request will be rejected by the
internal HTTP/1 parser.
Thanks to Tim Dsterhus for reporting that one.
Tim Dsterhus found that while the H2 path is checked for non-emptiness,
invalid chars and '*', a test is missing to verify that except for '*',
it always starts with exactly one '/'. During the reconstruction of the
full URI when passing to HTX, this missing test allows to affect the
apparent authority by appending a port number or a suffix name.
This only affects H2-to-H2 communications, as H2-to-H1 do not use the
full URI. Like for previous fix, the following rule inserted before
other ones in the frontend is sufficient to renormalize the internal
URI and let haproxy see the same authority as the target server:
http-request set-uri %[url]
This needs to be backported to 2.2. Earlier versions do not rebuild a
full URI using the authority and will fail on the malformed path at the
HTTP layer, so they are safe.
While we do explicitly check for strict character sets in the scheme,
this is only done when extracting URL components from an assembled one,
and we have special handling for "http" and "https" schemes directly in
the H2-to-HTX conversion. Sadly, this lets all other ones pass through
if they start exactly with "http://" or "https://", allowing the
reconstructed URI to start with a different looking authority if it was
part of the scheme.
It's interesting to note that in this case the valid authority is in
the Host header and that the request will only be wrong if emitted over
H2 on the backend side, since H1 will not emit an absolute URI by
default and will drop the scheme. So in essence, this is a variant of
the scheme-based attack described below in that it only affects H2-H2
and not H2-H1 forwarding:
https://portswigger.net/research/http2
As such, a simple workaround consists in just inserting the following
rule before other ones in the frontend, which will have for effect to
renormalize the authority in the request line according to the
concatenated version (making haproxy see the same authority and host
as what the target server will see):
http-request set-uri %[url]
This patch simply adds the missing syntax checks for non-http/https
schemes before the concatenation in the H2 code. An improvement may
consist in the future in splitting these ones apart in the start
line so that only the "url" sample fetch function requires to access
them together and that all other places continue to access them
separately. This will then allow the core code to perform such checks
itself.
The patch needs to be backported as far as 2.2. Before 2.2 the full
URI was not being reconstructed so the scheme and authority part were
always dropped from H2 requests to leave only origin requests. Note
for backporters: this depends on this previous patch:
MINOR: http: add a new function http_validate_scheme() to validate a scheme
Many thanks to Tim Dsterhus for figuring that one and providing a
reproducer.
While http_parse_scheme() extracts a scheme from a URI by extracting
exactly the valid characters and stopping on delimiters, this new
function performs the same on a fixed-size string.
txn functions can now be called from an action or a filter context. Thus the
return code must be adapted depending on this context. From an action, act.ABORT
is returned. From a filter, -1 is returned. It is the filter error code.
This bug only affects 2.5-dev. No backport needed.
CF_FLT_ANALYZE flags may be set before the FLT_END analyser. Thus if an error is
triggered in the mean time, this may block the stream and prevent it to be
released. It is indeed a problem only for the response channel because the
response analysers may be skipped on early errors.
So, to prevent any issue, depending on the code path, the FLT_END analyser is
systematically set when the CF_FLT_ANALYZE flag is set.
This patch must be backported in all stable branches.
The internal proxies should be part of the proxies list, because of
this, the check_config_validity() fonction could emit warnings about
these proxies.
This patch disables 3 startup warnings for internal proxies:
- "has no 'bind' directive" (this one was already ignored for the CLI
frontend, but we made it generic instead)
- "missing timeouts"
- "log format ignored"
User reported that the config check returns an error with the message:
"Configuration file has no error but will not start (no listener) => exit(2)."
if the configuration present only a log-forward section with bind or dgram-bind
listeners but no listen/backend nor peer sections.
The process checked if there was 'peers' section avalaible with
an internal frontend (and so a listener) or a 'listen/backend'
section not disabled with at least one configured listener (into the
global proxies_list). Since the log-forward proxies appear in a
different list, they were not checked.
This patch adds a lookup on the 'log-forward' proxies list to check
if one of them presents a listener and is not disabled. And
this is done only if there was no available listener found into
'listen/backend' sections.
I have also studied how to re-work this check considering the 'listeners'
counter used after startup/init to keep the same algo and avoid further
mistakes but currently this counter seems increased during config parsing
and if a proxy is disabled, decreased during startup/init which is done
after the current config check. So the fix still not rely on this
counter.
This patch should fix the github issue #1346
This patch should be backported as far as 2.3 (so on branches
including the "log-forward" feature)
When a lua filter declaration is parsed, some allocation errors were not
properly handled. In addition, we must be sure the filter identifier is defined
in lua to duplicate it when the filter configuration is filled.
This patch fix a defect reported in the issue #1347. It only concerns
2.5-dev. No backport needed.
In Channel and HTTPMessage classes, several functions uses an offset that
may be negative to start from the end of incoming data. But, after
calculation, the offset must never be negative. However, there is a bug
because of a bad cast to unsigned when "input + offset" is performed. The
result must be a signed integer.
This patch should fix most of defects reported in the issue #1347. It only
affects 2.5-dev. No backport needed.
Now an HTTPMessage class is available to manipulate HTTP message from a filter
it is possible to bind HTTP filters callback function on lua functions. Thus,
following methods may now be defined by a lua filter:
* Filter:http_headers(txn, http_msg)
* Filter:http_payload(txn, http_msg, offset, len)
* Filter:http_end(txn, http_msg)
http_headers() and http_end() may return one of the constant filter.CONTINUE,
filter.WAIT or filter.ERROR. If nothing is returned, filter.CONTINUE is used as
the default value. On its side, http_payload() may return the amount of data to
forward. If nothing is returned, all incoming data are forwarded.
For now, these functions are not allowed to yield because this interferes with
the filter workflow.
When a lua TXN is created from a filter context, the request and the response
HTTP message objects are accessible from ".http_req" and ".http_res" fields. For
an HTTP proxy, these objects are always defined. Otherwise, for a TCP proxy, no
object is created and nil is used instead. From any other context (action or
sample fetch), these fields don't exist.
This new class exposes methods to manipulate HTTP messages from a filter
written in lua. Like for the HTTP class, there is a bunch of methods to
manipulate the message headers. But there are also methods to manipulate the
message payload. This part is similar to what is available in the Channel
class. Thus the payload can be duplicated, erased, modified or
forwarded. For now, only DATA blocks can be retrieved and modified because
the current API is limited. No HTTPMessage method is able to yield. Those
manipulating the headers are always called on messages containing all the
headers, so there is no reason to yield. Those manipulating the payload are
called from the http_payload filters callback function where yielding is
forbidden.
When an HTTPMessage object is instantiated, the underlying Channel object
can be retrieved via the ".channel" field.
For now this class is not used because the HTTP filtering is not supported
yet. It will be the purpose of another commit.
There is no documentation for now.
It is now possible to write some filter callback functions in lua. All
filter callbacks are not supported yet but the mechanism to call them is now
in place. Following method may be defined in the Lua filter class to be
bound on filter callbacks:
* Filter:start_analyse(txn, chn)
* Filter:end_analyse(txn, chn)
* Filter:tcp_payload(txn, chn, offset, length)
hlua_filter_callback() function is responsible to call the good lua function
depending on the filter callback function. Using some flags it is possible
to allow a lua call to yield or not, to retrieve a return value or not, and
to specify if a channel or an http message must be passed as second
argument. For now, the HTTP part has not been added yet. It is also possible
to add extra argument adding them on the stack before the call.
3 new functions are exposed by the global object "filter". The first one,
filter.wake_time(ms_delay), to set the wake_time when a Lua callback
function yields (if allowed). The two others,
filter.register_data_filter(filter, chn) and
filter.unregister_data_filter(filter, chn), to enable or disable the data
filtering on a channel for a specific lua filter instance.
start_analyse() and end_analyse() may return one of the constant
filter.CONTINUE, filter.WAIT or filter.ERROR. If nothing is returned,
filter.CONTINUE is used as the default value. On its side, tcp_payload() may
return the amount of data to forward. If nothing is returned, all incoming
data are forwarded.
For now, these functions are not allowed to yield because this interferes
with the filter workflow.
Here is a simple example :
MyFilter = {}
MyFilter.id = "My Lua filter"
MyFilter.flags = filter.FLT_CFG_FL_HTX
MyFilter.__index = MyFilter
function MyFilter:new()
flt = {}
setmetatable(flt, MyFilter)
flt.req_len = 0
flt.res_len = 0
return flt
end
function MyFilter:start_analyze(txn, chn)
filter.register_data_filter(self, chn)
end
function MyFilter:end_analyze(txn, chn)
print("<Total> request: "..self.req_len.." - response: "..self.res_len)
end
function MyFilter:tcp_payload(txn, chn)
offset = chn:ouput()
len = chn:input()
if chn:is_resp() then
self.res_len = self.res_len + len
print("<TCP:Response> offset: "..offset.." - length: "..len)
else
self.req_len = self.req_len + len
print("<TCP:Request> offset: "..offset.." - length: "..len)
end
end
For filters written in lua, the tcp payloads will be filtered using methods
exposed by the Channel class. So the corrsponding C binding functions must
be prepared to process payload in a filter context and not only in an action
context.
The main change is the offset where to start to process data in the channel
buffer, and the length of these data. For an action, all input data are
considered. But for a filter, it depends on what the filter is allow to
forward when the tcp_payload callback function is called. It depends on
previous calls but also on other filters.
In addition, when the payload is modified by a lua filter, its context must
be updated. Note also that channel functions cannot yield when called from a
filter context.
For now, it is not possible to define callbacks to filter data and the
documentation has not been updated.
A lua TXN can be created when a sample fetch, an action or a filter callback
function is executed. A flag is now used to track the execute context.
Respectively, HLUA_TXN_SMP_CTX, HLUA_TXN_ACT_CTX and HLUA_TXN_FLT_CTX. The
filter flag is not used for now.
For now, there is no support for filters written in lua. So this function,
if called, will always return NULL. But when it will be called in a filter
context, it will return the filter structure attached to a channel
class. This function is also responsible to set the offset of data that may
be processed and the length of these data. When called outside a filter
context (so from an action), the offset is the input data position and the
length is the input data length. From a filter, the offset and the length of
data that may be filtered are retrieved the filter context.
It is now possible to write dummy filters in lua. Only the basis to declare
such filters has been added for now. There is no way to declare callbacks to
filter anything. Lua filters are for now empty nutshells.
To do so, core.register_filter() must be called, with 3 arguments, the
filter's name (as it appears in HAProxy config), the lua class that will be
used to instantiate filters and a function to parse arguments passed on the
filter line in HAProxy configuration file. The lua filter class must at
least define the method new(), without any extra args, to create new
instances when streams are created. If this method is not found, the filter
will be ignored.
Here is a template to declare a new Lua filter:
// haproxy.conf
global
lua-load /path/to/my-filter.lua
...
frontend fe
...
filter lua.my-lua-filter arg1 arg2 arg3
filter lua.my-lua-filter arg4 arg5
// my-filter.lua
MyFilter = {}
MyFilter.id = "My Lua filter" -- the filter ID (optional)
MyFilter.flags = filter.FLT_CFG_FL_HTX -- process HTX streams (optional)
MyFilter.__index = MyFilter
function MyFilter:new()
flt = {}
setmetatable(flt, MyFilter)
-- Set any flt fields. self.args can be used
flt.args = self.args
return flt -- The new instance of Myfilter
end
core.register_filter("my-lua-filter", MyFilter, function(filter, args)
-- process <args>, an array of strings. For instance:
filter.args = args
return filter
end)
In this example, 2 filters are declared using the same lua class. The
parsing function is called for both, with its own copy of the lua class. So
each filter will be unique.
The global object "filter" exposes some constants and flags, and later some
functions, to help writting filters in lua.
Internally, when a lua filter is instantiated (so when new() method is
called), 2 lua contexts are created, one for the request channel and another
for the response channel. It is a prerequisite to let some callbacks yield
on one side independently on the other one.
There is no documentation for now.
First of all, following functions are now considered deprecated:
* Channel:dup()
* Channel:get()
* Channel:getline()
* Channel:get_in_len()
* Cahnnel:get_out_len()
It is just informative, there is no warning and functions may still be
used. Howver it is recommended to use new functions. New functions are more
flexible and use a better naming pattern. In addition, the same names will
be used in the http_msg class to manipulate http messages from lua filters.
The new API is:
* Channel:data()
* Channel:line()
* Channel:append()
* Channel:prepend()
* Channel:insert()
* Channel:remove()
* Channel:set()
* Channel:input()
* Channel:output()
* Channel:send()
* Channel:forward()
* Channel:is_resp()
* Channel:is_full()
* Channel:may_recv()
The lua documentation was updated accordingly.
The main change is that following functions will now process channel's data
using an offset and a length:
* hlua_channel_dup_yield()
* hlua_channel_get_yield()
* hlua_channel_getline_yield()
* hlua_channel_append_yield()
* hlua_channel_set()
* hlua_channel_send_yield()
* hlua_channel_forward_yield()
So for now, the offset is always the input data position and the length is
the input data length. But with the support for filters, from a filter
context, these values will be relative to the filter.
To make all processing clearer, the function _hlua_channel_dup() has been
updated and _hlua_channel_dupline(), _hlua_channel_insert() and
_hlua_channel_delete() have been added.
This patch is mandatory to allow the support of the filters written in lua.
The hlua_checktable() function may now be used to create and return a
reference on a table in stack, given its position. This function ensures it
is really a table and throws an exception if not.
This patch is mandatory to allow the support of the filters written in lua.
Lua functions to set or append data to the input part of a channel must not
yield because new data may be received while the lua script is suspended. So
adding data to the input part in several passes is highly unpredicatble and
may be interleaved with received data.
Note that if necessary, it is still possible to suspend a lua action by
returning act.YIELD. This way the whole action will be reexecuted later
because of I/O events or a timer. Another solution is to call core.yield().
This bug affects all stable versions. So, it may be backported. But it is
probably not necessary because nobody notice it till now.
When a script is executed, it is not always allowed to yield. Lua sample
fetches and converters cannot yield. For lua actions, it depends on the
context. When called from tcp content ruleset, an action may yield until the
expiration of the inspect-delay timeout. From http rulesets, yield is not
possible.
Thus, when channel functions (dup, get, append, send...) are called, instead
of yielding when it is not allowed and triggering an error, we just give
up. In this case, some functions do nothing (dup, append...), some others
just interrupt the in-progress job (send, forward...). But, because these
functions don't yield anymore when it is not allowed, the script regains the
control and can continue its execution.
This patch depends on "MINOR: lua: Add a flag on lua context to know the
yield capability at run time". Both may be backported in all stable
versions. However, because nobody notice this bug till now, it is probably
not necessary, excepted if someone ask for it.
When a script is executed, a flag is used to allow it to yield. An error is
returned if a lua function yield, explicitly or not. But there is no way to
get this capability in C functions. So there is no way to choose to yield or
not depending on this capability.
To fill this gap, the flag HLUA_NOYIELD is introduced and added on the lua
context if the current script execution is not authorized to yield. Macros
to set, clear and test this flags are also added.
This feature will be usefull to fix some bugs in lua actions execution.
When at least one filter is registered on a stream, the FLT_END analyzer is
called on both direction when all other analyzers have finished their
processing. During this step, filters may release any allocated elements if
necessary. So it is important to not skip it.
Unfortunately, if both stream interfaces are closed, it is possible to not
wait the end of this analyzer. It is possible to be in this situation if a
filter must wait and prevents the analyzer completion. To fix the bug, we
now wait FLT_END analyzer is no longer registered on both direction before
releasing the stream.
This patch may be backported as far as 1.7, but AFAIK, no filter is affected
by this bug. So the backport seems to be optional for now. In any case, it
should remain under observation for some weeks first.
In tcpcheck_eval_send(), the condition to detect there are still pending
data in the output buffer is buggy. Presence of raw data must be tested for
TCP connection only. But a condition on the connection was missing to be
sure it is not an HTX connection.
This patch must be backported as far as 2.2.
The formatting of the buffer_dump() output must be calculated using the
relative counter, not the absolute one, or everything will be broken if
the <from> variable is not a multiple of 16.
Could be backported in all maintained versions.
A static server is able to support simultaneously both health chech and
agent-check. Adjust the dynamic server CLI handlers to also support this
configuration.
This should not be backported, unless dynamic server checks are
backported.
There is currently a leak on agent-check for dynamic servers. When
deleted, the check rules and vars are not liberated. This leak grows
each time a dynamic server with agent-check is deleted.
Replace the manual purge code by a free_check invocation which
centralizes all the details on check cleaning.
There is no leak for health check because in this case the proxy is the
owner of the check vars and rules.
This should not be backported, unless dynamic server checks are
backported.
If an error occured during a dynamic server creation, free_check is used
to liberate a possible agent-check. However, this does not free
associated vars and rules associated as this is done on another function
named deinit_srv_agent_check.
To simplify the check free and avoid a leak, move free vars/rules in
free_check. This is valid because deinit_srv_agent_check also uses
free_check.
This operation is done only for an agent-check because for a health
check, the proxy instance is the owner of check vars/rules.
This should not be backported, unless dynamic server checks are
backported.
Do not reset check flags when setting CHK_ST_PURGE.
Currently, this change has no impact. However, it is semantically wrong
to clear important flags such as CHK_ST_AGENT on purge.
Furthermore, this change will become mandatoy for a future fix to
properly free agent checks on dynamic servers removal. For this, it will
be needed to differentiate health/agent-check on purge via CHK_ST_AGENT
to properly free agent checks.
This must not be backported unless dynamic servers checks are
backported.
Currently there is a leak at process shutdown with dynamic servers with
check/agent-check activated. Check purges are not executed on process
stopping, so the server is not liberated due to its refcount.
The solution is simply to ignore the refcount on process stopping mode
and free the server on the first free_server invocation.
This should not be backported, unless dynamic server checks are
backported. In this case, the following commit must be backported first.
7afa5c1843
MINOR: global: define MODE_STOPPING
Test if server is not null before using free_server in the check purge
operation. Currently, the null server scenario should not occured as
purge is used with refcounted dynamic servers. However, this might not
be always the case if purge is use in the future in other cases; thus
the test is useful for extensibility.
No need to backport, unless dynamic server checks are backported.
This has been reported through a coverity report in github issue #1343.
This commit is the counterpart for agent check of
"MEDIUM: server: implement check for dynamic servers".
The "agent-check" keyword is enabled for dynamic servers. The agent
check must manually be activated via "enable agent" CLI. This can
enable the dynamic server if the agent response is "ready" without an
explicit "enable server" CLI.
Implement check support for dynamic servers. The "check" keyword is now
enabled for dynamic servers. If used, the server check is initialized
and the check task started in the "add server" CLI handler. The check is
explicitely disabled and must be manually activated via "enable health"
CLI handler.
The dynamic server refcount is incremented if a check is configured. On
"delete server" handler, the check is purged, which decrements the
refcount.
Implement a collection of keywords deemed safe and useful to dynamic
servers. The list of the supported keywords is :
- addr
- check-proto
- check-send-proxy
- check-via-socks4
- rise
- fall
- fastinter
- downinter
- port
- agent-addr
- agent-inter
- agent-port
- agent-send
Implement a mechanism to free a started check on runtime for dynamic
servers. A new function check_purge is created for this. The check task
will be marked for deletion and scheduled to properly close connection
elements and free the task/tasklet/buf_wait elements.
This function will be useful to delete a dynamic server wich checks.
It is necessary to have a refcount mechanism on dynamic servers to be
able to enable check support. Indeed, when deleting a dynamic server
with check activated, the check will be asynchronously removed. This is
mandatory to properly free the check resources in a thread-safe manner.
The server instance must be kept alive for this.
global maxsock is used to estimate a number of fd to reserve for
internal use, such as checks. It is incremented at startup with the info
from the config file.
Disable this incrementation in checks functions at runtime. First, it
currently serves no purpose to increment it after startup. Worse, it may
lead to out-of-bound accesse on the fdtab.
This will be useful to initiate checks for dynamic servers.
Remove static qualifier on init_srv_check, init_srv_agent_check and
start_check_task. These functions will be called in server.c for dynamic
servers with checks.
Allocate default tcp ruleset for every backend without explicit rules
defined, even if no server in the backend use check. This change is
required to implement checks for dynamic servers.
This allocation is done on check_config_validity. It must absolutely be
called before check_proxy_tcpcheck (called via post proxy check) which
allocate the implicit tcp connect rule.
Implement an equivalent of task_kill for tasklets. This function can be
used to request a tasklet deletion in a thread-safe way.
Currently this function is unused.
Remove the "DEPRECATED" marker on "enable/disable health/agent"
commands. Their purpose is to toggle the check/agent on a server.
These commands are still useful because their purpose is not covered by
the "set server" command. Most there was confusion with the commands
'set server health/agent', which in fact serves another goal.
Note that the indication "use 'set server' instead" has been added since
2016 on the commit
2c04eda8b5
REORG: cli: move "{enable|disable} health" to server.c
and
58d9cb7d22
REORG: cli: move "{enable|disable} agent" to server.c
Besides, these commands will become required to enable check/agent on
dynamic servers which will be created with check disabled.
This should be backported up to 2.4.
It is the second part of the fix that should solve fairness issues with the
connections management inside the SPOE filter. Indeed, in multithreaded
mode, when the SPOE detects there are some connections in queue on a server,
it closes existing connections by releasing SPOE applets. It is mandatory
when a maxconn is set because few connections on a thread may prenvent new
connections establishment.
The first attempt to fix this bug (9e647e5af "BUG/MEDIUM: spoe: Kill applets
if there are pending connections and nbthread > 1") introduced a bug. In
pipelining mode, SPOE applets might be closed while some frames are pending
for the ACK reply. To fix the bug, in the processing stage, if there are
some connections in queue, only truly idle applets may process pending
requests. In this case, only one request at a time is processed. And at the
end of the processing stage, only truly idle applets may be released. It is
an empirical workaround, but it should be good enough to solve contention
issues when a low maxconn is set.
This patch should partely fix the issue #1340. It must be backported as far
as 2.0.
On a thread, when the last SPOE applet is released, if there are still
pending streams, a new one is created. Of course, HAproxy must not be
stopping. It is important to start a new applet in this case to not abort
in-progress jobs, especially when a maxconn is set. Because applets may be
closed to be fair with connections waiting for a free slot.
This patch should partely fix the issue #1340. It depends on the commit
"MINOR: spoe: Create a SPOE applet if necessary when the last one on a
thread is closed". Both must be backported as far as 2.0.
There was no way to access the SPOE filter configuration from the agent
object. However it could be handy to have it. And in fact, this will be
required to fix a bug.
Nenad noticed that when leaving maintenance, the servers' last_change
field was not updated. This is visible in the Status column of the stats
page in front of the state, as the cumuled time spent in the current state
is wrong, it starts from the last transition (typically ready->maint). In
addition, the backend's state was not updated either, because the down
transition is performed by set_backend_down() which also emits a log, and
it is this function which was extended to update the backend's last_change,
but it's not called for down->up transitions so that was not done.
The most visible (and unpleasant) effect of this bug is that it affects
slowstart so such a server could immediately restart with a significant
load ratio.
This should likely be backported to all stable releases.
Right now we're using a DWCAS to atomically set the running_mask while
being constrained by the thread_mask. This DWCAS is annoying because we
may seriously need it later when adding support for thread groups, for
checking that the running_mask applies to the correct group.
It turns out that the DWCAS is not strictly necessary because we never
need it to set the thread_mask based on the running_mask, only the other
way around. And in fact, the running_mask is always cleared alone, and
the thread_mask is changed alone as well. The running_mask is only
relevant to indicate a takeover when the thread_mask matches it. Any
bit set in running and not present in thread_mask indicates a transition
in progress.
As such, it is possible to re-arrange this by using a regular CAS around a
consistency check between running_mask and thread_mask in fd_update_events
and by making a CAS on running_mask then an atomic store on the thread_mask
in fd_takeover(). The only other case is fd_delete() but that one already
sets the running_mask before clearing the thread_mask, which is compatible
with the consistency check above.
This change has happily survived 10 billion takeovers on a 16-thread
machine at 800k requests/s.
The fd-migration doc was updated to reflect this change.
This one is set whenever an FD is reported by a poller with a null owner,
regardless of the thread_mask. It has become totally meaningless because
it only indicates a migrated FD that was not yet reassigned to a thread,
but as soon as a thread uses it, the status will change to skip_fd. Thus
there is no reason to distinguish between the two, it adds more confusion
than it helps. Let's simply drop it.
If an error occured during the CLI 'add server' handler, the newly
created server must be removed from the proxy list if already inserted.
Currently, this can happen on the extremely rare error during server id
generation if there is no id left.
The removal operation is not thread-safe, it must be conducted before
releasing the thread isolation.
This can be backported up to 2.4. Please note that dynamic server track
is not implemented in 2.4, so the release_server_track invocation must
be removed for the backport to prevent a compilation error.
In 2.4, runtime server deletion was brought by commit e558043e1 ("MINOR:
server: implement delete server cli command"). A comment remained in the
code about a theoretical race between the thread_isolate() call and another
thread being in the process of allocating memory before accessing the
server via a reference that was grabbed before the memory allocation,
since the thread_harmless_now()/thread_harmless_end() pair around mmap()
may have the effect of allowing cli_parse_delete_server() to proceed.
Now that the full thread isolation is available, let's update the code
to rely on this. Now it is guaranteed that competing threads will either
be in the poller or queued in front of thread_isolate_full().
This may be backported to 2.4 if any report of breakage suggests the bug
really exists, in which case the two following patches will also be
needed:
MINOR: threads: make thread_release() not wait for other ones to complete
MEDIUM: threads: add a stronger thread_isolate_full() call
The current principle of running under isolation was made to access
sensitive data while being certain that no other thread was using them
in parallel, without necessarily having to place locks everywhere. The
main use case are "show sess" and "show fd" which run over long chains
of pointers.
The thread_isolate() call relies on the "harmless" bit that indicates
for a given thread that it's not currently doing such sensitive things,
which is advertised using thread_harmless_now() and which ends usings
thread_harmless_end(), which also waits for possibly concurrent threads
to complete their work if they took this opportunity for starting
something tricky.
As some system calls were notoriously slow (e.g. mmap()), a bunch of
thread_harmless_now() / thread_harmless_end() were placed around them
to let waiting threads do their work while such other threads were not
able to modify memory contents.
But this is not sufficient for performing memory modifications. One such
example is the server deletion code. By modifying memory, it not only
requires that other threads are not playing with it, but are not either
in the process of touching it. The fact that a pool_alloc() or pool_free()
on some structure may call thread_harmless_now() and let another thread
start to release the same object's memory is not acceptable.
This patch introduces the concept of "idle threads". Threads entering
the polling loop are idle, as well as those that are waiting for all
others to become idle via the new function thread_isolate_full(). Once
thread_isolate_full() is granted, the thread is not idle anymore, and
it is released using thread_release() just like regular isolation. Its
users have to keep in mind that across this call nothing is granted as
another thread might have performed shared memory modifications. But
such users are extremely rare and are actually expecting this from their
peers as well.
Note that that in case of backport, this patch depends on previous patch:
MINOR: threads: make thread_release() not wait for other ones to complete
The original intent of making thread_release() wait for other requesters to
proceed was more of a fairness trade, guaranteeing that a thread that was
granted an access to the CPU would be in turn giving back once its job is
done. But this is counter-productive as it forces such threads to spin
instead of going back to the poller, and it prevents us from implementing
multiple levels of guarantees, as a thread_release() call could spin
waiting for another requester to pass while that requester expects
stronger guarantees than the current thread may be able to offer.
Let's just remove that wait period and let the thread go back to the
poller, a-la "race to idle".
While in theory it could possibly slightly increase the perceived
latency of concurrent slow operations like "show fd" or "show sess",
it is not the case at all in tests, probably because the time needed
to reach the poller remains extremely low anyway.
Probably due to a copy-paste, there were two indent levels in this function
since its introduction in 1.9 by commit 60b639ccb ("MEDIUM: hathreads:
implement a more flexible rendez-vous point"). Let's fix this.
If an error occurs during a dynamic server creation with tracking, it
must be removed from the tracked list. This operation is not thread-safe
and thus must be conducted under the thread isolation.
Track support for dynamic servers has been introduced in this release.
This does not need to be backported.
Previous patch b5c0d65 ("MINOR: proxy: disabled takes a stopping and a
disabled state") allows us to set 2 states for a stopped or a disabled
proxy. With this patch we are now able to show the stats of all proxies
when the process is in a stopping states, not only when there is some
activity on a proxy.
This patch should fix issue #1307.
This patch splits the disabled state of a proxy into a PR_DISABLED and a
PR_STOPPED state.
The first one is set when the proxy is disabled in the configuration
file, and the second one is set upon a stop_proxy().
Rename the 'dontloglegacyconnerr' option to 'log-error-via-logformat'
which is much more self-explanatory and readable.
Note: only legacy keywords don't use hyphens, it is recommended to
separate words with them in new keywords.
update_freq_ctr_period() was using relaxed atomics without using barriers,
which usually works fine on x86 but not everywhere else. In addition, some
values were read without being enclosed by barriers, allowing the compiler
to possibly prefetch them a bit earlier. Finally, freq_ctr_total() was also
reading these without enough barriers. Let's make explicit use of atomic
loads and atomic stores to get rid of this situation. This required to
slightly rearrange the freq_ctr_total() loop, which could possibly slightly
improve performance under extreme contention by avoiding to reread all
fields.
A backport may be done to 2.4 if a problem is encountered, but last tests
on arm64 with LSE didn't show any issue so this can possibly stay as-is.
This function already performs a number of checks prior to calling the
IOCB, and detects the change of thread (FD migration). Half of the
controls are still in each poller, and these pollers also maintain
activity counters for various cases.
Note that the unreliable test on thread_mask was removed so that only
the one performed by fd_set_running() is now used, since this one is
reliable.
Let's centralize all that fd-specific logic into the function and make
it return a status among:
FD_UPDT_DONE, // update done, nothing else to be done
FD_UPDT_DEAD, // FD was already dead, ignore it
FD_UPDT_CLOSED, // FD was closed
FD_UPDT_MIGRATED, // FD was migrated, ignore it now
Some pollers already used to call it last and have nothing to do after
it, regardless of the result. epoll has to delete the FD in case a
migration is detected. Overall this removes more code than it adds.
If an MT-aware poller reports that a file descriptor was migrated, it
must stop reporting it. The simplest way to do this is to program an
update if not done yet. This will automatically mark the FD for update
on next round. Otherwise there's a risk that some events are reported
a bit too often and cause extra CPU usage with these pollers. Note
that epoll is currently OK regarding this. Select does not need this
because it uses a single shared events table, so in case of migration
no FD change is expected.
This should be backported as far as 2.2.
The skip_fd counter that is incremented when a migrated FD is reported
was abnormally high in with poll. The reason is that it was accounted
for before preparing the polled events instead of being measured from
the reported events.
This mistake was done when the counters were introduced in 1.9 with
commit d80cb4ee1 ("MINOR: global: add some global activity counters to
help debugging"). It may be backported as far as 2.0.
In 1.8, commit ab62f5195 ("MINOR: polling: Use fd_update_events to update
events seen for a fd") updated the pollers to rely on fd_update_events(),
but the modification delayed the test of presence of the FD in the report,
resulting in owner/thread_mask and possibly event updates being performed
for each FD appearing in a block of 32 FDs around an active one. This
caused the request rate to be ~3 times lower with select() than poll()
under 6 threads.
This can be backported as far as 1.8.
A bug was introduced in 2.1-dev2 by commit 305d5ab46 ("MAJOR: fd: Get
rid of the fd cache."). Pollers "poll" and "evport" had the sleeping
bit accidentally removed before the syscall instead of after. This
results in them not being woken up by inter-thread wakeups, which is
particularly visible with the multi-queue accept() and with queues.
As a work-around, when these pollers are used, "nbthread 1" should
be used.
The fact that it has remained broken for 2 years is a great indication
that threads are definitely not enabled outside of epoll and kqueue,
hence why this patch is only tagged medium.
This must be backported as far as 2.2.
In case of connection failure, a dedicated error message is output,
following the format described in section "Error log format" of the
documentation. These messages cannot be configured through a log-format
option.
This patch adds a new option, "dontloglegacyconnerr", that disables
those error logs when set, and "replaces" them by a regular log line
that follows the configured log-format (thanks to a call to sess_log in
session_kill_embryonic).
The new fc_conn_err sample fetch allows to add the legacy error log
information into a regular log format.
This new option is unset by default so the logging logic will remain the
same until this new option is used.
This new sample fetch along the ssl_fc_hsk_err_str fetch contain the
last SSL error of the error stack that occurred during the SSL
handshake (from the frontend's perspective). The errors happening during
the client's certificate verification will still be given by the
ssl_c_err and ssl_c_ca_err fetches. This new fetch will only hold errors
retrieved by the OpenSSL ERR_get_error function.
The ssl_c_err, ssl_c_ca_err and ssl_c_ca_err_depth sample fetches values
were not recoverable when the connection failed because of the test
"conn->flags & CO_FL_WAIT_XPRT" (which required the connection to be
established). They could then not be used in a log-format since whenever
they would have sent a non-null value, the value fetching was disabled.
This patch ensures that all these values can be fetched in case of
connection failure.
The fc_conn_err and fc_conn_err_str sample fetches give information
about the problem that made the connection fail. This information would
previously only have been given by the error log messages meaning that
thanks to these fetches, the error log can now be included in a custom
log format. The log strings were all found in the conn_err_code_str
function.
Cleanup the mworker_cli_proxy_create() function by removing the
allocation and init of the proxy which is done manually, and replace it
by alloc_new_proxy(). Do the same with the free_proxy() function.
This patch also move the insertion at the end of the function.
Disable the output of the statistics of internal proxies (PR_CAP_INT),
wo we don't rely only on the px->uuid > 0. This will allow to hide more
cleanly the internal proxies in the stats.
This patch renames the proxy capability "LUA" to "INT" so it could be
used for any internal proxy.
Every proxy that are not user defined should use this flag.
This part was fixed several times since commit aade4edc1 ("BUG/MEDIUM:
mux-h2: Don't handle pending read0 too early on streams") and there are
still some cases where a read0 event may be ignored because a partial frame
inhibits the event.
Here, we must take care to set H2_CF_END_REACHED flag if a read0 was
received while a partial frame header is received or if the padding length
is missing.
To ease partial frame detection, H2_CF_DEM_SHORT_READ flag is introduced. It
is systematically removed when some data are received and is set when a
partial frame is found or when dbuf buffer is empty. At the end of the
demux, if the connection must be closed ASAP or if data are missing to move
forward, we may acknowledge the pending read0 event, if any. For now,
H2_CF_DEM_SHORT_READ is not part of H2_CF_DEM_BLOCK_ANY mask.
This patch should fix the issue #1328. It must be backported as far as 2.0.
The splicing does not work anymore because the H1 connection is not swap to
splice mode when rcv_pipe() callback function is called. It is important to
set H1C_F_WANT_SPLICE flag to inhibit data receipt via the buffer
API. Otherwise, because there are always data in the buffer, it is not
possible to use the kernel splicing.
This bug was introduced by the commit 2b861bf72 ("MINOR: mux-h1: clean up
conditions to enabled and disabled splicing").
The patch must be backported to 2.4.
If a connection is closed during the preface while no data are received, if
the dontlognull option is set, no log message must be emitted. However, this
will still be handled as a protocol error. Only the log is omitted.
This patch should fix the issue #1336 for H2 sessions. It must be backported
to 2.4 and 2.3 at least, and probably as far as 2.0.
If a H1 connection is closed while no data are received, if the dontlognull
option is set, no log message must be emitted. Because the H1 multiplexer
handles early errors, it must take care to obey this option. It is true for
400-Bad-Request, 408-Request-Time-out and 501-Not-Implemented
responses. 500-Internal-Server-Error responses are still logged.
This patch should fix the issue #1336 for H1 sessions. It must be backported
to 2.4.
Use non-checked function to retrieve listener/server via obj_type. This
is done as a previous obj_type function ensure that the type is well
known and the instance is not NULL.
Incidentally, this should prevent the coverity report from the #1335
github issue which warns about a possible NULL dereference.
When we evaluate a DNS response item, it may be necessary to look for a
server with a hostname matching the item target into the named servers
tree. To do so, the item target is transformed to a lowercase string. It
must be a null-terminated string. Thus we must explicitly set the trailing
'\0' character.
For a specific resolution, the named servers tree contains all servers using
this resolution with a hostname loaded from a state file. Because of this
bug, same entry may be duplicated because we are unable to find the right
server, assigning this way the item to a free server slot.
This patch should fix the issue #1333. It must be backported as far as 2.2.
Commit 048368ef6 ("MINOR: deinit: always deinit the init_mutex on
failed initialization") added the missing unlock but forgot to
condition it on USE_THREAD, resulting in a build failure. No
backport is needed.
This addresses oss-fuzz issue 36426.
A config like the below fails to validate because of a bogus test:
backend b1
tcp-check connect port 1234
option tcp-check
server s1 1.2.3.4 check
[ALERT] (18887) : config : config: proxy 'b1': server 's1' has neither
service port nor check port, and a tcp_check rule
'connect' with no port information.
A || instead of a && only validates the connect rule when both the
address *and* the port are set. A work around is to set the rule like
this:
tcp-check connect addr 0:1234 port 1234
This needs to be backported as far as 2.2 (2.0 is OK).
Agent stats were lost during the stats refactoring performed in the 2.4 to
simplify the Prometheus exporter. stats_fill_sv_stats() function must fill
ST_F_AGENT_* and ST_F_LAST_AGT stats.
This patch should fix the issue #1331. It must be backported to 2.4.
Some ssl samples cause a segfault when the stream is not instantiated,
for example during an invalid HTTP request. A new check is added to
prevent the stream dereferencing if NULL.
This is the list of the affected samples :
- ssl_s_chain_der
- ssl_s_der
- ssl_s_i_dn
- ssl_s_key_alg
- ssl_s_notafter
- ssl_s_notbefore
- ssl_s_s_dn
- ssl_s_serial
- ssl_s_sha1
- ssl_s_sig_alg
- ssl_s_version
This bug can be reproduced easily by using one of these samples in a
log-format string. Emit an invalid HTTP request with an HTTP client to
trigger the crash.
This bug has been reported in redmine issue 3913.
This must be backported up to 2.2.
This undocumented variable is only for internal use, and its sole
presence affects the process' behavior, as shown in bug #1324. It must
not be exported to workers, external checks, nor programs. Let's unset
it before forking programs and workers.
This should be backported as far as 1.8. The worker code might differ
a bit before 2.5 due to the recent removal of multi-process support.
The master-worker code registers an exit handler to deal with configuration
issues during reload, leading to a restart of the master process in wait
mode. But it shouldn't do that when it's expected that the program stops
during config parsing or condition checks, as the reload operation is
unexpectedly called and results in abnormal behavior and even crashes:
$ HAPROXY_MWORKER_REEXEC=1 ./haproxy -W -c -f /dev/null
Configuration file is valid
[NOTICE] (18418) : haproxy version is 2.5-dev2-ee2420-6
[NOTICE] (18418) : path to executable is ./haproxy
[WARNING] (18418) : config : Reexecuting Master process in waitpid mode
Segmentation fault
$ HAPROXY_MWORKER_REEXEC=1 ./haproxy -W -cc 1
[NOTICE] (18412) : haproxy version is 2.5-dev2-ee2420-6
[NOTICE] (18412) : path to executable is ./haproxy
[WARNING] (18412) : config : Reexecuting Master process in waitpid mode
[WARNING] (18412) : config : Reexecuting Master process
Note that the presence of this variable happens by accident when haproxy
is called from within its own programs (see issue #1324), but this should
be the object of a separate fix.
This patch fixes this by preventing the atexit registration in such
situations. This should be backported as far as 1.8. MODE_CHECK_CONDITION
has to be dropped for versions prior to 2.5.
Oss-fuzz reports in issue 36328 that we can recurse too far by passing
extremely deep expressions to the ".if" parser. I thought we were still
limited to the 1024 chars per line, that would be highly sufficient, but
we don't have any limit now :-/
Let's just pass a maximum recursion counter to the recursive parsers.
It's decremented for each call and the expression fails if it reaches
zero. On the most complex paths it can add 3 levels per parenthesis,
so with a limit of 1024, that's roughly 343 nested sub-expressions that
are supported in the worst case. That's more than sufficient, for just
a few kB of RAM.
No backport is needed.
The init_mutex was not unlocked in case an error is encountered during
a thread initialization, and the polling loop was aborted during startup.
In practise it does not have any observable effect since an explicit
exit() is placed there, but it could confuse some debugging tools or
some static analysers, so let's release it as expected.
This addresses issue #1326.
Since last change on HTTP analysers (252412316 "MEDIUM: proxy: remove
long-broken 'option http_proxy'"), http_process_request() may only return
internal errors on failures. Thus the label used to handle bad requests may
be removed.
This patch should fix the issue #1330.
This option had always been broken in HTX, which means that the first
breakage appeared in 1.9, that it was broken by default in 2.0 and that
no workaround existed starting with 2.1. The way this option works is
praticularly unfit to the rest of the configuration and to the internal
architecture. It had some uses when it was introduced 14 years ago but
nowadays it's possible to do much better and more reliable using a
set of "http-request set-dst" and "http-request set-uri" rules, which
additionally are compatible with DNS resolution (via do-resolve) and
are not exclusive to normal load balancing. The "option-http_proxy"
example config file was updated to reflect this.
The option is still parsed so that an error message gives hints about
what to look for.
The cfg_free_cond_{term,and,expr}() functions used to take a pointer to
the pointer to be freed in order to replace it with a NULL once done.
But this doesn't cope well with freeing lists as it would require
recursion which the current code tried to avoid.
Let's just change the API to free the area and let the caller set the NULL.
This leak was reported by oss-fuzz (issue 36265).
While we do free the array containing the arguments, we do not free
allocated ones. Most of them are unresolved, but strings are allocated
and have to be freed as well. Note that for the sake of not breaking
the args resolution list that might have been set, we still refrain
from doing this if a resolution was already programmed, but for most
common cases (including the ones that can be found in config conditions
and at run time) we're safe.
This may be backported to stable branches, but it relies on the new
free_args() function that was introduced by commit ab213a5b6 ("MINOR:
arg: add a free_args() function to free an args array"), and which is
likely safe to backport as well.
This leak was reported by oss-fuzz (issue 36265).
The removal for the shared inter-process cache in commit 6fd0450b4
("CLEANUP: shctx: remove the different inter-process locking techniques")
accidentally removed the enforcement of rlimit_memmax_all which
corresponds to what is passed to the command-line "-m" argument.
Let's restore it.
Thanks to @nafets227 for spotting this. This fixes github issue #1319.
Now it's possible to form a term using parenthesis around an expression.
This will soon allow to build more complex expressions. For now they're
still pretty limited but parenthesis do work.
Now evaluating a condition will rely on an expression (or an empty string),
and this expression will support ORing a sub-expression with another
optional expression. The sub-expressions ANDs a term with another optional
sub-expression. With this alone precedence between && and || is respected,
and the following expression:
A && B && C || D || E && F || G
will naturally evaluate as:
(A && B && C) || D || (E && F) || G
It's not convenient to let the caller be responsible for node allocation,
better have the leaf function do that and implement the accompanying free
call. Now only a pointer is needed instead of a struct, and the leaf
function makes sure to leave the situation in a consistent way.
Till now we were dealing with single-word expressions but in order to
extend the configuration condition language a bit more, we'll need to
support slightly more complex expressions involving operators, and we
must absolutely support spaces around them to keep them readable.
As all arguments are pointers to the same line with spaces replaced by
zeroes, we can trivially rebuild the whole line before calling the
condition evaluator, and remove the test for extraneous argument. This
is what this patch does.
Random characters placed after a configuration predicate currently do
not report an error. This is a problem because extra parenthesis,
commas or even other random left-over chars may accidently appear there.
Let's now report an error when this happens.
This is marked MEDIUM because it may break otherwise working configs
which are faulty.
The purpose is to build a descendent parser that will split conditions
into expressions made of terms. There are two phases, a parsing phase
and an evaluation phase. Strictly speaking it's not required to cut
that in two right now, but it's likely that in the future we won't want
certain predicates to be evaluated during the parsing (e.g. file system
checks or execution of some external commands).
The cfg_eval_condition() function is now much simpler, it just tries to
parse a single term, and if OK evaluates it, then returns the result.
Errors are unchanged and may still be reported during parsing or
evaluation.
It's worth noting that some invalid expressions such as streq(a,b)zzz
continue to parse correctly for now (what remains after the parenthesis
is simply ignored as not necessary).
The .if/.else/.endif and condition evaluation code is quite dirty and
was dumped into cfgparse.c because it was easy. But it should be tidied
quite a bit as it will need to evolve.
Let's move all that to cfgcond.{c,h}.
Argument arrays used in hlua_lua2arg_check() as well as in the functions
used to call sample fetches and converters were manually released, let's
use the cleaner and more reliable free_args() instead. The prototype of
hlua_lua2arg_check() was amended to mention that the function relies on
the final ARGT_STOP, which is already the case, and the pointless test
for this was removed.
make_arg_list() can create an array of arguments, some of which remain
to be resolved, but all users had to deal with their own roll back on
error. Let's add a free_args() function to release all the array's
elements and let the caller deal with the array itself (sometimes it's
allocated in the stack).
I found myself a few times testing some conditoin examples from the doc
against command line's "-cc" to see that they didn't work with environment
variables expansion. Not being documented as being on purpose it looks like
a miss, so let's add PARSE_OPT_ENV and PARSE_OPT_WORD_EXPAND to be able to
test for example -cc "streq(${WITH_SSL},yes)" to help debug expressions.
This adds the exact same restriction as commit 5546c8bdc ("MINOR:
cfgparse: Fail when encountering extra arguments in macro") but for
the "-cc" command line argument, for the sake of consistency.
Allow the usage of the 'track' keyword for dynamic servers. On server
deletion, the server is properly removed from the tracking chain to
prevents NULL pointer dereferencing.
Prevents the use of the "track" keyword for a dynamic server. This
simplifies the deletion of a dynamic server, without having to worry
about servers which might tracked it.
A BUG_ON is present in the dynamic server delete function to validate
this assertion.
TCC doesn't have the equivalent of __builtin_unreachable() and complains
that hlua_panic_ljmp() may return no value. Let's add a return 0 there.
All compilers that know that longjmp() doesn't return will see no change
and tcc will be happy.
Modern compilers love to break existing code, and some options detected
at build time (such as -fwrapv) are absolutely critical otherwise some
bad code can be generated.
Given that some users rely on packages that force CFLAGS without being
aware of this and can be hit by runtime bugs, we have to help packagers
figure that they need to be careful about their build options.
The test here consists in detecting correct wrapping of signed integers.
Some of the old code relies on it, and modern compilers recently decided
to break it. It's normally addressed using -fwrapv which users will
rarely enforce in their own flags. Thus it is a good indicator of missing
critical CFLAGS, and it happens to be very easy to detect at run time.
Note that the test uses argc in order to have a variable. While gcc
ignores wrapping even for constants, clang only ignores it for variables.
The way the code is constructed doesn't result in code being emitted for
optimized builds thanks to value range propagation.
This should address GitHub issue #1315, and should be backported to all
stable versions. It may result in instantly breaking binaries that seemed
to work fine (typically the ones suddenly showing a busy loop after a few
weeks of uptime), and require packagers to fix their flags. The vast
majority of distro packages are fine and will not be affected though.
When a default-server line specified a client certificate to use, the
frontend would not take it into account and create an empty SSL context,
which would raise an error on the backend side ("peer did not return a
certificate").
This bug was introduced by d817dc733e in
which the SSL contexts are created earlier than before (during the
default-server line parsing) without setting it in the corresponding
server structures. It then made the server create an empty SSL context
in ssl_sock_prepare_srv_ctx because it thought it needed one.
It was raised on redmine, in Bug #3906.
It can be backported to 2.4.
Since 1.9 with commit 673867c35 ("MAJOR: applets: Use tasks, instead
of rolling our own scheduler.") the thread_mask field of the appctx
became unused, but the code hadn't been cleaned for this. The appctx
has its own task and the task's thread_mask is the one to be displayed.
It's worth noting that all calls to appctx_new() pass tid_bit as the
thread_mask. This makes sense, and it could be convenient to decide
that this becomes the norm and to simplify the API.
Define a new global config statement named
"h2-workaround-bogus-websocket-clients".
This statement will disable the automatic announce of h2 websocket
support as specified in the RFC8441. This can be use to overcome clients
which fail to implement the relatively fresh RFC8441. Clients will in
his case automatically downgrade to http/1.1 for the websocket tunnel
if the haproxy configuration allows it.
This feature is relatively simple and can be backported up to 2.4, which
saw the introduction of h2 websocket support.
Fix the wrong usage of http_uri_parser which is defined with an
uninitialized uri. This causes a crash which happens when forwarding a
request to a backend configured in plain proxy ('option http_proxy').
This has been reported through a clang warning on the CI.
This bug has been introduced by the refactoring of URI parser API.
c453f9547e
MINOR: http: use http uri parser for path
This does not need to be backported.
WARNING: although this patch fix the crash, the 'option http_proxy'
seems to be non buggy, possibly since quite a few stable versions.
Indeed, the URI rewriting is not functional : the path is written on the
beginning of the URI but the rest of the URI is not and this garbage is
passed to the server which does not understand the request.
Replace http_get_path by the http_uri_parser API. The new functions is
renamed http_parse_path. Replace duplicated code for scheme and
authority parsing by invocations to http_parse_scheme/authority.
If no scheme is found for an URI detected as an absolute-uri/authority,
consider it to be an authority format : no path will be found. For an
absolute-uri or absolute-path, use the remaining of the string as the
path. A new http_uri_parser state is declared to mark the path parsing
as done.
Split in two the condition which check if the monitor-uri is set for the
current request. This will allow to easily use the http_uri_parser type
for http_get_path.
Replace http_get_authority by the http_uri_parser API.
The new function is renamed http_parse_authority. Replace duplicated
scheme parsing code by http_parse_scheme invocation. A new
http_uri_parser state is declared to mark the authority parsing as done.
Replace http_get_scheme by the http_uri_parser API. The new function is
renamed http_parse_scheme. A new http_uri_parser state is declared to
mark the scheme parsing as completed.
Apply the rfc 3986 scheme-based normalization on h2 requests. This
process will be executed for most of requests because scheme and
authority are present on every h2 requests, except CONNECT. However, the
normalization will only be applied on requests with defaults http port
(http/80 or https/443) explicitly specified which most http clients
avoid.
This change is notably useful for http2 websockets with Firefox which
explicitly specify the 443 default port on Extended CONNECT. In this
case, users can be trapped if they are using host routing without
removing the port. With the scheme-based normalization, the default port
will be removed.
To backport this change, it is required to backport first the following
commits:
* MINOR: http: implement http_get_scheme
* MEDIUM: http: implement scheme-based normalization
Apply the rfc 3986 scheme-based normalization on h1 requests. It is
executed only for requests which uses absolute-form target URI, which is
not the standard case.
Implement the scheme-based uri normalization as described in rfc3986
6.3.2. Its purpose is to remove the port of an uri if the default one is
used according to the uri scheme : 80/http and 443/https. All other
ports are not touched.
This method uses an htx message as an input. It requires that the target
URI is in absolute-form with a http/https scheme. This represents most
of h2 requests except CONNECT. On the contrary, most of h1 requests
won't be elligible as origin-form is the standard case.
The normalization is first applied on the target URL of the start line.
Then, it is conducted on every Host headers present, assuming that they
are equivalent to the target URL.
This change will be notably useful to not confuse users who are
accustomed to use the host for routing without specifying default ports.
This problem was recently encountered with Firefox which specify the 443
default port for http2 websocket Extended CONNECT.
gcc 8.3.0 spews a bunch of:
src/stick_table.c: In function 'action_inc_gpc0':
include/haproxy/freq_ctr.h:66:12: warning: 'period' may be used uninitialized in this function [-Wmaybe-uninitialized]
curr_tick += period;
^~
src/stick_table.c:2241:15: note: 'period' was declared here
unsigned int period;
^~~~~~
but they're incorrect because all accesses are guarded by the exact same
condition (ptr1 not being null), it's just the compiler being overzealous
about the uninitialized detection that seems to be stronger than its
ability to follow its own optimizations. This code path is not critical,
let's just pre-initialize the period to zero.
No backport is needed.
After reloading HAProxy, the old process may still hold active sessions.
Currently there is no way to gather information, how many sessions such
a process still holds. This patch will not exclude disabled proxies from
stats output when they hold at least one active session. This will allow
sending `!@<PID> show stat` through a master socket to the disabled
process and have it returning its stats data.
This reverts commit 19bbbe0562.
For now, set-src/set-src-port actions are directly performed on the client
connection. Using these actions at the stream level is really a problem with
HTTP connection (See #90) because all requests are affected by this change
and not only the current request. And it is worse with the H2, because
several requests can set their source address into the same connection at
the same time.
It is already an issue when these actions are called from "http-request"
rules. It is safer to wait a bit before adding the support to "tcp-request
content" rules. The solution is to be able to set src/dst address on the
stream and not on the connection when the action if performed from the L7
level..
Reverting the above commit means the issue #1303 is no longer fixed.
This patch must be backported in all branches containing the above commit
(as far as 2.0 for now).
A server name was displayed as <srv>/<proxy> instead of the reverse.
It only confuses diagnostics. This was introduced by commit 7a4a0ac71
("MINOR: cli: add a new "show fd" command") so this fix can be backport
down to 1.8.
As shown in issue #1251, it is possible for a connect() to report an
error directly via the poller without ever reporting send readiness,
but currentlt sock_conn_check() manages to ignore that situation,
leading to high CPU usage as poll() wakes up on these FDs.
The bug was apparently introduced in 1.5-dev22 with commit fd803bb4d
("MEDIUM: connection: add check for readiness in I/O handlers"), but
was likely only woken up by recent changes to conn_fd_handler() that
made use of wakeups instead of direct calls between 1.8 and 1.9,
voiding any chance to catch such errors in the early recv() callback.
The exact sequence that leads to this situation remains obscure though
because the poller does not report send readiness nor does it report an
error. Only HUP and IN are reported on the FD. It is also possible that
some recent kernel updates made this condition appear while it never
used to previously.
This needs to be backported to all stable branches, at least as far
as 2.0. Before 2.2 the code was in tcp_connect_probe() in proto_tcp.c.
This patch makes the use of 'gpc' excluding the use of the legacy
types 'gpc0' and 'gpc1" on the same table.
It also makes the use of 'gpc_rate' excluding the use of the legacy
types 'gpc0_rate' and 'gpc1_rate" on the same table.
The 'gpc0' and 'gpc1' related fetches and actions will apply
to the first two elements of the 'gpc' array if stored in table.
The 'gpc0_rate' and 'gpc1_rate' related fetches and actions will apply
to the first two elements of the 'gpc_rate' array if stored in table.
This patch adds the definition of two new array data_types:
'gpc': This is an array of 32bits General Purpose Counters.
'gpc_rate': This is an array on increment rates of General Purpose Counters.
Like for all arrays, they are limited to 100 elements.
This patch also adds actions and fetches to handle
elements of those arrays.
Note: As documented, those new actions and fetches won't
apply to the legacy 'gpc0', 'gpc1', 'gpc0_rate' nor 'gpc1_rate'.
This patch makes the use of 'gpt' excluding the use of the legacy
type 'gpt0' on the same table.
It also makes the 'gpt0' related fetches and actions applying
to the first element of the 'gpt' array if stored in table.
This patch adds the definition of a new array data_type
'gpt'. This is an array of 32bits General Purpose Tags.
Like for all arrays, it is limited to 100 elements.
This patch also adds actions and fetches to handle
elements of this array.
Note: As documented, those new actions and fetches won't
apply to the legacy 'gpt0' data type.
This patch adds support of array data_types on the peer protocol.
The table definition message will provide an additionnal parameter
for array data-types: the number of elements of the array.
In case of array of frqp it also provides a second parameter:
the period used to compute freq counter.
The array elements are std_type values linearly encoded in
the update message.
Note: if a remote peer announces an array data_type without
parameters into the table definition message, all updates
on this table will be ignored because we can not
parse update messages consistently.
This patch provides the code to handle arrays of some
standard types (SINT, UINT, ULL and FRQP) in stick table.
This way we could define new "array" data types.
Note: the number of elements of an array was limited
to 100 to put a limit and to ensure that an encoded
update message will continue to fit into a buffer
when the peer protocol will handle such data types.
This patch replaces all advanced data type aliases on
stktable_data_cast calls by standard types.
This way we could call the same stktable_data_cast
regardless of the used advanced data type as long they
are using the same std type.
It also removes all the advanced data type aliases.
This patch fixes the computation of the bit of the current data_type
in some part of code of peer protocol where the computation is limited
to 32bits whereas the bitfield of data_types can support 64bits.
Without this patch it could result in bugs when we will define more
than 32 data_types.
Backport is useless because there is currently less than 32 data_types
This patch fixes several errors printing integers
of stick table entry values and args during dump on cli.
This patch should be backported since the dump of entries
is supported. [wt: roughly 1.5-dev1 hence all stable branches]
The commit 3406766d5 ("MEDIUM: resolvers: add a ref between servers and srv
request or used SRV record") introduced a regression. The first server of a
template based on SRV record is no longer resolved. The same bug exists for
a normal server based on a SRV record.
In fact, the server used during parsing (used as reference when a
server-template line is parsed) is never attached to the corresponding srvrq
object. Thus with following lines, no resolution is performed because
"srvrq->attached_servers" is empty:
server-template test 1 _http.domain.tld resolvers dns ...
server test1 _http.domain.tld resolvers dns ...
This patch should fix the issue #1295 (but not confirmed yet it is the same
bug). It must be backported everywhere the above commit is.
As specified by the MQTT specification (MQTT-3.1.3-6), the client ID may be
empty. That means the length of the client ID string may be 0. However, The
MQTT parser does not support empty strings.
So, to fix the bug, the mqtt_read_string() function may now parse empty
string. 2 bytes must be found to decode the string length, but the length
may be 0 now. It is the caller responsibility to test the string emptiness
if necessary. In addition, in mqtt_parse_connect(), the client ID may be
empty now.
This patch should partely fix the issue #1310. It must be backported to 2.4.
Parsing of too long strings (> 127 characters) was buggy because of a wrong
cast on the length bytes. To fix the bug, we rely on mqtt_read_2byte_int()
function. This way, the string length is properly decoded.
This patch should partely fix the issue #1310. It must be backported to 2.4.
Since recent commit 469c06c30 ("MINOR: http-act/tcp-act: Add "set-mark"
and "set-tos" for tcp content rules") there's a build warning (or error)
on Windows due to static function tcp_action_set_mark() not being used
because the set-mark functionality is not supported there. It's caused
by the fact that only the parsing function uses it so if the code is
ifdefed out the function remains unused.
Let's surround it with ifdefs as well, and do the same for
tcp_action_set_tos() which could suffer the same fate on operating systems
not defining IP_TOS.
This may need to be backported if the patch above is backported. Also
be careful, the condition was adjusted to cover FreeBSD after commit
f7f53afcf ("BUILD/MEDIUM: tcp: set-mark setting support for FreeBSD.").
It is now possible to set the Netfilter MARK and the TOS field value in all
packets sent to the client from any tcp-request rulesets or the "tcp-response
content" one. To do so, the parsing of "set-mark" and "set-tos" actions are
moved in tcp_act.c and the actions evaluation is handled in dedicated functions.
This patch may be backported as far as 2.2 if necessary.
It is now possible to set the "nice" factor of the current stream from a
"tcp-request content" or "tcp-response content" ruleset. To do so, the
action parsing is moved in stream.c and the action evaluation is handled in
a dedicated function.
This patch may be backported as far as 2.2 if necessary.
It is now possible to set the stream log level from a "tcp-request content"
or "tcp-response content" ruleset. To do so, the action parsing is moved in
stream.c and the action evaluation is handled in a dedicated function.
This patch should fix issue #1306. It may be backported as far as 2.2 if
necessary.
The index of the failing rule is reported in the health-check log message. The
rules index is also used in the check traces. But for implicit HTTP send/expect
rules, the index is wrong. It must be incremented by one compared to the
preceding rule.
This patch may be backported as far as 2.2.
In srv_parse_agent_check the error code is not returned in case
something goes wrong. The value 0 is always return.
Additionally, there's a small cleanup of unreachable returns that in
most checks are not present either and removed in two places they were
present. This makes the code consistent across the different checks.
If resolv_get_ip_from_response() returns an error (or an unexpected return
value), the server is set to RMAINT status. However, its address must also
be reset. Otherwise, it is still reported by the cli on "show servers state"
commands. This may be confusing. Note that it is a theorical patch because
this code path does not exist. Thus it is not tagged as a BUG.
This patch may be backported as far as 2.0.
For A/AAAA resolution, if no ip is found for a server in the response, the
server is set to RMAINT status. However, its address must also be
reset. Otherwise, it is still reported by the cli on "show servers state"
commands. This may be confusing.
This patch may be backported as far as 2.0.
On A/AAAA resolution, for a given server, if a record is matching, we must
always attach the server to this record. Before it was only done if the
server IP was not the same than the record one. However, it is a problem if
the server IP was not set for a previous resolution. From the libc during
startup for instance. In this case, the server IP is not updated and the
server is not attached to any record. It remains in this state while a
matching record is found in the DNS response. It is especially a problem
when the resolution is used for server-templates.
This bug was introduced by the commit bd78c912f ("MEDIUM: resolvers: add a
ref on server to the used A/AAAA answer item").
This patch should solve the issue #1305. It must be backported to all
versions containing the above commit.
A dedicated queue lock was added by commit 16fbdda3c ("MEDIUM: queue:
use a dedicated lock for the queues (v2)") but during its rebase, some
labels were lost and left to SERVER_LOCK / PROXY_LOCK instead of
QUEUE_LOCK. It's harmless but can confuse the lock debugger, so better
fix it.
No backport is needed.
Commit ae0b12ee0 ("MEDIUM: queue: use a trylock on the server's queue")
introduced a hard to trigger bug that's more visible with a single thread:
if a server dequeues a connection and finds another free slot with no
connection to place there, process_srv_queue() will never break out of
the loop. In multi-thread it almost does not happen because other threads
bring new connections.
No backport is needed as it's only in -dev.
Since the code paths became exactly the same except for what log field
to update, let's simplify the code and move further code out of the
lock. The queue position update and the test for server vs proxy do not
need to be inside the lock.
Now we directly use p->queue to get to the queue, which is much more
straightforward. The performance on 100 servers and 16 threads
increased from 560k to 574k RPS, or 2.5%.
A lot more simplifications are possible, but the minimum was done at
this point.
A queue is specific to a server or a proxy, so we don't need to place
this distinction inside all pendconns, it can be in the queue itself.
This commit adds the relevant fields "px" and "sv" into the struct
queue, and initializes them accordingly.
Doing so makes sure that threads attempting to wake up new connections
for a server will give up early if another thread is already in charge
of this. The goal is to avoid unneeded contention on low server counts.
Now with a single server with 16 threads in roundrobin we get the same
performance as with multiple servers, i.e. ~575kreq/s instead of ~496k
before. Leastconn is seeing a similar jump, from ~460 to ~560k (the
difference being the calls to fwlc_srv_reposition).
The overhead of process_srv_queue() is now around 2% instead of ~20%
previously.
There's no point keeping the proxy lock held for a long time, it's
only needed when checking the proxy's queue, and keeping it prevents
multiple servers from dequeuing in parallel. Let's move it into
pendconn_process_next_strm() and release it ASAP. The pendconn
remains under the server queue lock's protection, guaranteeing that
no stream will release it while it's being touched.
For roundrobin, the performance increases by 76% (327k to 575k) on
16 threads. Even with a single server and maxconn=100, the performance
increases from 398 to 496 kreq/s. For leastconn, almost no change is
visible (less than one percent) but this is expected since most of the
time there is spent in fwlc_reposition() and fwlc_get_next_server().
Doing so allows to retrieve and update the pendconn's queue index outside
of the queue's lock and to save one more percent CPU on a highly-contented
backend.
The code only differed by the nbpend_max counter. Let's have a pointer
to it and merge the two variants to always use a generic queue. It was
initially considered to put the max inside the queue structure itself,
but the stats support clearing values and maxes and this would have been
the only counter having to be handled separately there. Given that we
don't need this max anywhere outside stats, let's keep it where it is
and have a pointer to it instead.
The CAS loop to update the max remains. It was naively thought that it
would have been faster without atomic ops inside the lock, but this is
not the case for the simple reason that it is a max, it converges very
quickly and never has to perform the check anymore. Thus this code is
better out of the lock.
The queue_idx is still updated inside the lock since that's where the
idx is updated, though it could be performed using atomic ops given
that it's only used to roughly count places for logging.
This basically undoes the API changes that were performed by commit
0274286dd ("BUG/MAJOR: server: fix deadlock when changing maxconn via
agent-check") to address the deadlock issue: since process_srv_queue()
doesn't use the server lock anymore, it doesn't need the "server_locked"
argument, so let's get rid of it before it gets used again.
Till now whenever a server or proxy's queue was touched, this server
or proxy's lock was taken. Not only this requires distinct code paths,
but it also causes unnecessary contention with other uses of these locks.
This patch adds a lock inside the "queue" structure that will be used
the same way by the server and the proxy queuing code. The server used
to use a spinlock and the proxy an rwlock, though the queue only used
it for locked writes. This new version uses a spinlock since we don't
need the read lock part here. Tests have not shown any benefit nor cost
in using this one versus the rwlock so we could change later if needed.
The lower contention on the locks increases the performance from 362k
to 374k req/s on 16 threads with 20 servers and leastconn. The gain
with roundrobin even increases by 9%.
This is tagged medium because the lock is changed, but no other part of
the code touches the queues, with nor without locking, so this should
remain invisible.
There's no point doing atomic incs over px->served/px->totpend under the
locks from the inner loop, as this value is used by the LB algorithms but
not during the dequeuing step. In addition, the LB algo's take_conn()
doesn't need to be refreshed for each and every connection taken
under the lock, it can be performed once at the end and out of the
lock.
While the gain on roundrobin is not noticeable (only the atomic inc),
on leastconn which uses take_conn(), the performance increases from
355k to 362k req/s on 16 threads.
This reverts commit 5304669e1b.
The recent changes since 5304669e1 MEDIUM: queue: make
pendconn_process_next_strm() only return the pendconn opened a tiny race
condition between stream_free() and process_srv_queue(), as the pendconn
is accessed outside of the lock, possibly while it's being freed. A
different approach is required.
This reverts commit 3e92a31783.
The recent changes since 5304669e1 MEDIUM: queue: make
pendconn_process_next_strm() only return the pendconn opened a tiny race
condition between stream_free() and process_srv_queue(), as the pendconn
is accessed outside of the lock, possibly while it's being freed. A
different approach is required.
This reverts commit 1b648c857b.
The recent changes since 5304669e1 MEDIUM: queue: make
pendconn_process_next_strm() only return the pendconn opened a tiny race
condition between stream_free() and process_srv_queue(), as the pendconn
is accessed outside of the lock, possibly while it's being freed. A
different approach is required.
This reverts commit fcb8bf8650.
The recent changes since 5304669e1 MEDIUM: queue: make
pendconn_process_next_strm() only return the pendconn opened a tiny race
condition between stream_free() and process_srv_queue(), as the pendconn
is accessed outside of the lock, possibly while it's being freed. A
different approach is required.
This reverts commit c83e45e9b0.
The recent changes since 5304669e1 MEDIUM: queue: make
pendconn_process_next_strm() only return the pendconn opened a tiny race
condition between stream_free() and process_srv_queue(), as the pendconn
is accessed outside of the lock, possibly while it's being freed. A
different approach is required.
This reverts commit 3eecdb65c5.
The recent changes since 5304669e1 MEDIUM: queue: make
pendconn_process_next_strm() only return the pendconn opened a tiny race
condition between stream_free() and process_srv_queue(), as the pendconn
is accessed outside of the lock, possibly while it's being freed. A
different approach is required.
This reverts commit 1335eb9867.
The recent changes since 5304669e1 MEDIUM: queue: make
pendconn_process_next_strm() only return the pendconn opened a tiny race
condition between stream_free() and process_srv_queue(), as the pendconn
is accessed outside of the lock, possibly while it's being freed. A
different approach is required.
This reverts commit de814dd422.
The recent changes since 5304669e1 MEDIUM: queue: make
pendconn_process_next_strm() only return the pendconn opened a tiny race
condition between stream_free() and process_srv_queue(), as the pendconn
is accessed outside of the lock, possibly while it's being freed. A
different approach is required.
This reverts commit 9a6d0ddbd6.
The recent changes since 5304669e1 MEDIUM: queue: make
pendconn_process_next_strm() only return the pendconn opened a tiny race
condition between stream_free() and process_srv_queue(), as the pendconn
is accessed outside of the lock, possibly while it's being freed. A
different approach is required.
This reverts commit 5b39275311.
The recent changes since 5304669e1 MEDIUM: queue: make
pendconn_process_next_strm() only return the pendconn opened a tiny race
condition between stream_free() and process_srv_queue(), as the pendconn
is accessed outside of the lock, possibly while it's being freed. A
different approach is required.
This reverts commit 772e968b06.
The recent changes since 5304669e1 MEDIUM: queue: make
pendconn_process_next_strm() only return the pendconn opened a tiny race
condition between stream_free() and process_srv_queue(), as the pendconn
is accessed outside of the lock, possibly while it's being freed. A
different approach is required.
If it possible to set source IP/Port from "tcp-request connection",
"tcp-request session" and "http-request" rules but not from "tcp-request
content" rules. There is no reason for this limitation and it may be a
problem for anyone wanting to call a lua fetch to dynamically set source
IP/Port from a TCP proxy. Indeed, to call a lua fetch, we must have a
stream. And there is no stream when "tcp-request connection/session" rules
are evaluated.
Thanks to this patch, "set-src" and "set-src-port" action are now supported
by "tcp_request content" rules.
This patch is related to the issue #1303. It may be backported to all stable
versions.
In 1.4, consistent hashing was brought by commit 6b2e11be1 ("[MEDIUM]
backend: implement consistent hashing variation") which took care of
replacing all direct calls to map_get_server_rr() with an alternate
call to chash_get_next_server() if consistent hash was being used.
One of them, however, cannot happen because a preliminary test for
static round-robin is being done prior to the call, so we're certain
that if it matches it cannot use a consistent hash tree.
Let's remove it.
Dealing with the queue lock in the caller remains complicated. Let's
change pendconn_first() to take the queue instead of the tree head,
and handle the lock itself. It now returns an element with a locked
queue or no element with an unlocked queue. It can avoid locking if
the queue is already empty.
There's no point keeping the server's queue lock after seeing that the
server's queue is empty, just like there's no need to keep the proxy's
lock when its queue is empty. This patch checks for emptiness and
releases these locks as soon as possible.
With this the performance increased from 524k to 530k on 16 threads
with round-robin.
By placing the lock there, it becomes possible to lock the proxy
later and to unlock it earlier. The server unlocking also happens slightly
earlier.
The performance on roundrobin increases from 481k to 524k req/s on 16
threads. Leastconn shows about 513k req/s (the difference being the
take_conn() call).
The performance profile changes from this:
9.32% hap-pxok [.] process_srv_queue
7.56% hap-pxok [.] pendconn_dequeue
6.90% hap-pxok [.] pendconn_add
to this:
7.42% haproxy [.] process_srv_queue
5.61% haproxy [.] pendconn_dequeue
4.95% haproxy [.] pendconn_add
By doing so we can move some evaluations outside of the lock and the
loop. In the round robin case, the performance increases from 497k to
505k rps on 16 threads with 100 servers.
Doing so allows to retrieve and update the pendconn's queue index outside
of the queue's lock and to save one more percent CPU on a highly-contented
backend.
The code only differed by the nbpend_max counter. Let's have a pointer
to it and merge the two variants to always use a generic queue. It was
initially considered to put the max inside the queue structure itself,
but the stats support clearing values and maxes and this would have been
the only counter having to be handled separately there. Given that we
don't need this max anywhere outside stats, let's keep it where it is
and have a pointer to it instead.
The CAS loop to update the max remains. It was naively thought that it
would have been faster without atomic ops inside the lock, but this is
not the case for the simple reason that it is a max, it converges very
quickly and never has to perform the check anymore. Thus this code is
better out of the lock.
The queue_idx is still updated inside the lock since that's where the
idx is updated, though it could be performed using atomic ops given
that it's only used to roughly count places for logging.
This basically undoes the API changes that were performed by commit
0274286dd ("BUG/MAJOR: server: fix deadlock when changing maxconn via
agent-check") to address the deadlock issue: since process_srv_queue()
doesn't use the server lock anymore, it doesn't need the "server_locked"
argument, so let's get rid of it before it gets used again.
Till now whenever a server or proxy's queue was touched, this server
or proxy's lock was taken. Not only this requires distinct code paths,
but it also causes unnecessary contention with other uses of these locks.
This patch adds a lock inside the "queue" structure that will be used
the same way by the server and the proxy queuing code. The server used
to use a spinlock and the proxy an rwlock, though the queue only used
it for locked writes. This new version uses a spinlock since we don't
need the read lock part here. Tests have not shown any benefit nor cost
in using this one versus the rwlock so we could change later if needed.
The lower contention on the locks increases the performance from 491k
to 507k req/s on 16 threads with 20 servers and leastconn. The gain
with roundrobin even increases by 6%.
The performance profile changes from this:
13.03% haproxy [.] fwlc_srv_reposition
8.08% haproxy [.] fwlc_get_next_server
3.62% haproxy [.] process_srv_queue
1.78% haproxy [.] pendconn_dequeue
1.74% haproxy [.] pendconn_add
to this:
11.95% haproxy [.] fwlc_srv_reposition
7.57% haproxy [.] fwlc_get_next_server
3.51% haproxy [.] process_srv_queue
1.74% haproxy [.] pendconn_dequeue
1.70% haproxy [.] pendconn_add
At this point the differences are mostly measurement noise.
This is tagged medium because the lock is changed, but no other part of
the code touches the queues, with nor without locking, so this should
remain invisible.
This essentially reverts commit 2b4370078 ("MINOR: lb/api: let callers
of take_conn/drop_conn tell if they have the lock") that was merged
during 2.4 before the various locks could be eliminated at the lower
layers. Passing that information complicates the cleanup of the queuing
code and it's become useless.
The lock in process_srv_queue() was placed around the whole loop to
avoid the cost of taking/releasing it multiple times. But in practice
almost all calls to this function only dequeue a single connection, so
that argument doesn't really stand. However by placing the lock inside
the loop, we'd make it possible to release it before manipulating the
pendconn and waking the task up. That's what this patch does.
This increases the performance from 431k to 491k req/s on 16 threads
with 20 servers under leastconn.
The performance profile changes from this:
14.09% haproxy [.] process_srv_queue
10.22% haproxy [.] fwlc_srv_reposition
6.39% haproxy [.] fwlc_get_next_server
3.97% haproxy [.] pendconn_dequeue
3.84% haproxy [.] pendconn_add
to this:
13.03% haproxy [.] fwlc_srv_reposition
8.08% haproxy [.] fwlc_get_next_server
3.62% haproxy [.] process_srv_queue
1.78% haproxy [.] pendconn_dequeue
1.74% haproxy [.] pendconn_add
The difference is even slightly more visible in roundrobin which
does not have take_conn() call.
It used to do far too much under the lock, including waking up tasks,
updating counters and repositionning entries in the load balancing algo.
This patch first moves all that stuff out of the function into the only
caller (process_srv_queue()). The decision to update the LB algo is now
taken out of the lock. The wakeups could be performed outside of the
loop by using a local list.
This increases the performance from 377k to 431k req/s on 16 threads
with 20 servers under leastconn.
The perf profile changes from this:
23.17% haproxy [.] process_srv_queue
6.58% haproxy [.] pendconn_add
6.40% haproxy [.] pendconn_dequeue
5.48% haproxy [.] fwlc_srv_reposition
3.70% haproxy [.] fwlc_get_next_server
to this:
13.95% haproxy [.] process_srv_queue
9.96% haproxy [.] fwlc_srv_reposition
6.21% haproxy [.] fwlc_get_next_server
3.96% haproxy [.] pendconn_dequeue
3.75% haproxy [.] pendconn_add
The server_parse_maxconn_change_request locks the server lock. However,
this function can be called via agent-checks or lua code which already
lock it. This bug has been introduced by the following commit :
commit 79a88ba3d0
BUG/MAJOR: server: prevent deadlock when using 'set maxconn server'
This commit tried to fix another deadlock with can occur because
previoulsy server_parse_maxconn_change_request requires the server lock
to be held. However, it may call internally process_srv_queue which also
locks the server lock. The locking policy has thus been updated. The fix
is functional for the CLI 'set maxconn' but fails to address the
agent-check / lua counterparts.
This new issue is fixed in two steps :
- changes from the above commit have been reverted. This means that
server_parse_maxconn_change_request must again be called with the
server lock.
- to counter the deadlock fixed by the above commit, process_srv_queue
now takes an argument to render the server locking optional if the
caller already held it. This is only used by
server_parse_maxconn_change_request.
The above commit was subject to backport up to 1.8. Thus this commit
must be backported in every release where it is already present.
Since commit c7eedf7a5 ("MINOR: queue: reduce the locked area in
pendconn_add()") the stream's pend_pos is set out of the lock, after
the pendconn is queued. While this entry is only manipulated by the
stream itself and there is no bug caused by this right now, it's a
bit dangerous because another thread could decide to look at this
field during dequeuing and could randomly see something else. Also
in case of crashes, memory inspection wouldn't be as trustable.
Let's assign the pendconn before it can be found in the queue.
Activate the 'ssl' keyword for dynamic servers. This is the final step
to have ssl dynamic servers feature implemented. If activated,
ssl_sock_prepare_srv_ctx will be called at the end of the 'add server'
CLI handler.
At the same time, update the management doc to list all ssl keywords
implemented for dynamic servers.
These keywords are deemed safe-enough to be enable on dynamic servers.
Their parsing functions are simple and can be called at runtime.
- allow-0rtt
- alpn
- ciphers
- ciphersuites
- force-sslv3/tlsv10/tlsv11/tlsv12/tlsv13
- no-sslv3/tlsv10/tlsv11/tlsv12/tlsv13
- no-ssl-reuse
- no-tls-tickets
- npn
- send-proxy-v2-ssl
- send-proxy-v2-ssl-cn
- sni
- ssl-min-ver
- ssl-max-ver
- tls-tickets
- verify
- verifyhost
'no-ssl-reuse' and 'no-tls-tickets' are enabled to override the default
behavior.
'tls-tickets' is enable to override a possible 'no-tls-tickets' set via
the global option 'ssl-default-server-options'.
'force' and 'no' variants of tls method options are useful to override a
possible 'ssl-default-server-options'.
File-access through ssl_store_load_locations_file is deactivated if
srv_parse_crl is used at runtime for a dynamic server. The crl must
have already been loaded either in the config or through the 'ssl crl'
CLI commands.
File-access through ssl_store_load_locations_file is deactivated if
srv_parse_crt is used at runtime for a dynamic server. The cert must
have already been loaded either in the config or through the 'ssl cert'
CLI commands.
File-access through ssl_store_load_locations_file is deactivated if
srv_parse_ca_file is used at runtime for a dynamic server. The ca-file
must have already been loaded either in the config or through the 'ssl
ca-file' CLI commands.
This will be in preparation for support of ssl on dynamic servers. The
'alpn' keyword will be allowed for dynamic servers but not the
'check-alpn'.
The alpn parsing is extracted into a new function parse_alpn. Each
srv_parse_alpn and srv_parse_check_alpn called it.
The function ssl_sock_load_srv_cert will be used at runtime for dynamic
servers. If the cert is not loaded on ckch tree, we try to access it
from the file-system.
Now this access operation is rendered optional by a new function
argument. It is only allowed at parsing time, but will be disabled for
dynamic servers at runtime.