When the H1 connection is released, a connection shutdown is now performed.
If it was already performed when the stream was detached, this action has no
effect. But it is mandatory, when an idle H1C is released. Otherwise the
xprt and the socket shutdown is never perfmed. It is especially important
for SSL client connections, because it is the only way to perform a clean
SSL shutdown.
Without this patch, SSL_shutdown is never called, preventing, among other
things, the SSL session caching.
This patch depends on the commit "BUG/MINOR: mux-h1: Save shutdown mode if
the shutdown is delayed". It should be backported as far as 2.0.
The connection shutdown may be delayed if there are pending outgoing
data. The action is performed once data are fully sent. In this case the
mode (dirty/clean) was lost and a clean shutdown was always performed. Now,
the mode is saved to be sure to perform the connection shutdown using the
right mode. To do so, H1C_F_ST_SILENT_SHUT flag is introduced.
This patch should be backported as far as 2.0.
With this feature the lua implementation of the httpclient is now able
to stream a payload larger than an haproxy buffer.
The hlua_httpclient_send() function is now split into:
hlua_httpclient_send() which initiate the httpclient and parse the lua
parameters
hlua_httpclient_snd_yield() which will send the request and be called
again to stream the request if the body is larger than an haproxy buffer
hlua_httpclient_rcv_yield() which will receive the response and store it
in the lua buffer.
This patch add a way to handle HTTP requests streaming using a
callback.
The end of the data must be specified by using the "end" parameter in
httpclient_req_xfer().
The OpenSSL documentation (https://www.openssl.org/docs/man1.1.0/man3/HMAC.html)
specifies:
> It places the result in md (which must have space for the output of the hash
> function, which is no more than EVP_MAX_MD_SIZE bytes). If md is NULL, the
> digest is placed in a static array. The size of the output is placed in
> md_len, unless it is NULL. Note: passing a NULL value for md to use the
> static array is not thread safe.
`EVP_MAX_MD_SIZE` appears to be defined as `64`, so let's simply use a stack
buffer to avoid the whole memory management.
At a few places we were still using protocol_by_family() instead of
the richer protocol_lookup(). The former is limited as it enforces
SOCK_STREAM and a stream protocol at the control layer. At least with
protocol_lookup() we don't have this limitationn. The values were still
set for now but later we can imagine making them configurable on the
fly.
Instead of using sock_type and ctrl_type to select a protocol, let's
make use of the new protocol type. For now they always match so there
is no change. This is applied to address parsing and to socket retrieval
from older processes.
The protocol selection is currently performed based on the family,
control type and socket type. But this is often not enough, as both
only provide DGRAM or STREAM, leaving few variants. Protocols like
SCTP for example might be indistinguishable from TCP here. Same goes
for TCP extensions like MPTCP.
This commit introduces a new enum proto_type that is placed in each
and every protocol definition, that will usually more or less match
the sock_type, but being an enum, will support additional values.
The test on the sock_domain is a bit useless because the protocols are
registered at boot time, and the test silently fails and returns no
error. Use a BUG_ON() instead to make sure to catch such bugs in the
code if any.
When compiled without SSL support, a variable is reported as not used by
GCC.
src/log.c: In function ‘sess_build_logline’:
src/log.c:2056:36: error: unused variable ‘conn’ [-Werror=unused-variable]
2056 | struct connection *conn;
| ^~~~
This does not need to be backported.
Because source and destination address of the client connection are now
updated at the appropriated level (connection, session or stream), original
info about the client connection are preserved. src/src_port/src_is_local
and dst/dst_port/dst_is_local return current info about the client
connection. It is the info at the highest available level. Most of time, the
stream. Any tcp/http rules may alter this info.
To get original info, "fc_" prefix must be added. For instance
"fc_src". Here, only "tcp-request connection" rules may alter source and
destination address/port.
This patch was reverted because it was inconsitent to change connection
addresses at stream level. Especially in HTTP because all requests was
affected by this change and not only the current one. In HTTP/2, it was
worse. Several streams was able to change the connection addresses at the
same time.
It is no longer an issue, thanks to recent changes. With multi-level client
source and destination addresses, it is possible to limit the change to the
current request. Thus this patch can be reintroduced.
If it possible to set source IP/Port from "tcp-request connection",
"tcp-request session" and "http-request" rules but not from "tcp-request
content" rules. There is no reason for this limitation and it may be a
problem for anyone wanting to call a lua fetch to dynamically set source
IP/Port from a TCP proxy. Indeed, to call a lua fetch, we must have a
stream. And there is no stream when "tcp-request connection/session" rules
are evaluated.
Thanks to this patch, "set-src" and "set-src-port" action are now supported
by "tcp_request content" rules.
This patch is related to the issue #1303.
When client source or destination addresses are changed via a tcp/http
action, we update addresses at the appropriate level. When "tcp-request
connection" rules are evaluated, we update addresses at the connection
level. When "tcp-request session" rules is evaluated, we update those at the
session level. And finally, when "tcp-request content" or "http-request"
rules are evaluated, we update the addresses at the stream level.
The same is performed when source or destination ports are changed.
Of course, for now, not all level are supported. But thanks to this patch,
it will be possible.
Just like for the PROXY protocol, when the NetScaler Client IP insertion
header is received, the retrieved client source and destination addresses
are set at the session level. This leaves those at the connection level
intact.
When PROXY protocol line is received, the retrieved client source and
destination addresses are set at the session level. This leaves those at the
connection level intact.
Client source and destination addresses at stream level are used to initiate
the connections to a server. For now, stream-interface addresses are never
set. So, thanks to the fallback mechanism, no changes are expected with this
patch. But its purpose is to rely on addresses at the appropriate level when
set instead of those at the connection level.
If the stream exists, the frontend stream-interface is used to get the
client source and destination addresses when the proxy line is built. For
now, stream-interface or session addresses are never set. So, thanks to the
fallback mechanism, no changes are expected with this patch. But its purpose
is to rely on addresses at the appropriate level when set instead of those
at the connection level.
In src, src-port, dst and dst-port sample fetches, the client source and
destination addresses are retrieved from the appropriate level. It means
that, if the stream exits, we use the frontend stream-interface to get the
client source and destination addresses. Otherwise, the session is used. For
now, stream-interface or session addresses are never set. So, thanks to the
fallback mechanism, no changes are expected with this patch. But its purpose
is to rely on addresses at the appropriate level when set instead of those
at the connection level.
Client source and destination addresses at stream level are now used to emit
SERVER_NAME/SERVER_PORT and REMOTE_ADDR/REMOTE_PORT parameters. For now,
stream-interface addresses are never set. So, thanks to the fallback
mechanism, no changes are expected with this patch. But its purpose is to
rely on addresses at the stream level, when set, instead of those at the
connection level.
Client source and destination addresses at stream level are now used to
compute base32+src and url32+src hashes. For now, stream-interface addresses
are never set. So, thanks to the fallback mechanism, no changes are expected
with this patch. But its purpose is to rely on addresses at the stream
level, when set, instead of those at the connection level.
Client source and destination addresses at stream level are now used to emit
X-Forwarded-For and X-Original-To headers. For now, stream-interface addresses
are never set. So, thanks to the fallback mechanism, no changes are expected
with this patch. But its purpose is to rely on addresses at the stream level,
when set, instead of those at the connection level.
When an embryonic session is killed, if no log format is defined for this
error, a generic error is emitted. When this happens, we now rely on the
session to get the client source address. For now, session addresses are
never set. So, thanks to the fallback mechanism, no changes are expected
with this patch. But its purpose is to rely on addresses at the session
level when set instead of those at the connection level.
When a log message is emitted, if the stream exits, we use the frontend
stream-interface to retrieve the client source and destination
addresses. Otherwise, the session is used. For now, stream-interface or
session addresses are never set. So, thanks to the fallback mechanism, no
changes are expected with this patch. But its purpose is to rely on
addresses at the appropriate level when set instead of those at the
connection level.
For now, stream-interface or session addresses are never set. So, thanks to
the fallback mechanism, no changes are expected with this patch. But its
purpose is to rely on the client addresses at the stream level, when set,
instead of those at the connection level. The addresses are retrieved from
the frontend stream-interface.
For now, these addresses are never set. But the idea is to be able to set, at
least first, the client source and destination addresses at the stream level
without updating the session or connection ones.
Of course, because these addresses are carried by the strream-interface, it
would be possible to set server source and destination addresses at this level
too.
Functions to fill these addresses have been added: si_get_src() and
si_get_dst(). If not already set, these functions relies on underlying
layers to fill stream-interface addresses. On the frontend side, the session
addresses are used if set, otherwise the client connection ones are used. On
the backend side, the server connection addresses are used.
And just like for sessions and conncetions, si_src() and si_dst() may be used to
get source and destination addresses or the stream-interface. And, if not set,
same mechanism as above is used.
For now, these addresses are never set. But the idea is to be able to set
client source and destination addresses at the session level without
updating the connection ones.
Functions to fill these addresses have been added: sess_get_src() and
sess_get_dst(). If not already set, these functions relies on
conn_get_src() and conn_get_dst() to fill session addresses.
And just like for conncetions, sess_src() and sess_dst() may be used to get
source and destination addresses. However, if not set, the corresponding
address from the underlying client connection is returned. When this
happens, the addresses is filled in the connection object.
hlua_http_msg_get_body must return either a Lua string or nil. For some
HTTPMessage objects, HTX_BLK_EOT blocks are also present in the HTX buffer
along with HTX_BLK_DATA blocks. In such cases, _hlua_http_msg_dup will start
copying data into a luaL_Buffer until it encounters an HTX_BLK_EOT. But then
instead of pushing neither the luaL_Buffer nor `nil` to the Lua stack, the
function will return immediately. The end result will be that the caller of
the HTTPMessage.body() method from a Lua filter will see whatever object was
on top of the stack as return value. It may be either a userdata object if
HTTPMessage.body() was called with only two arguments, or the third argument
itself if called with three arguments. Hence HTTPMessage.body() would return
either nil, or HTTPMessage body as Lua string, or a userdata objects, or
number.
This fix ensure that HTTPMessage.body() will always return either a string
or nil.
Reviewed-by: Christopher Faulet <cfaulet@haproxy.com>
Add a check during the httpclient request generation which yield an lua
error when the generation didn't work. The most common case is the lack
of space in the buffer, it can because of too much headers or a too big
body.
Add support for HEAD/PUT/POST/DELETE method with the lua httpclient.
This patch use the httpclient_req_gen() function with a different meth
parameter to implement this.
Also change the reg-test to support a POST request with a body.
httpclient_req_gen() takes a payload argument which can be use to put a
payload in the request. This payload can only fit a request buffer.
This payload can also be specified by the "body" named parameter within
the lua. httpclient.
It is also used within the CLI httpclient when specified as a CLI
payload with "<<".
Remove the zeroing of an idle connection node on remove from a tree.
This is not needed and should improve slightly the performance of idle
connection usage. Besides, it breaks the memory poisoning feature.
Skip the hash connection calcul when reuse must not be used in
connect_server() : this is the case for TCP proxies. This should result
in slightly better performance when using this use-case.
In connect_server(), if http-reuse always is set, the backend connection
is inserted into the available tree as soon as created. However, the
hash connection field is only set later at the end of the function.
This seems to have no impact as the hash connection field is always
position before a lookup. However, this is not a proper usage of ebmb
API. Fix this by setting the hash connection field before the insertion
into the avail tree.
This must be backported up to 2.4.
Add traces in connect_server() to debug idle connection reuse. These
are attached to stream trace module, as it's already in use in
backend.c with the macro TRACE_SOURCE.
The current model causes an issue when trying to spot memory leaks,
because malloc(0) or realloc(0) do not count as allocations since we only
account for the application-usable size. This is the problem that made
issue #1406 not to appear as a leak.
What we're doing now is to account for one extra pointer (the one that
memory allocators usually place before the returned area), so that a
malloc(0) will properly account for 4 or 8 bytes. We don't need something
exact, we just need something non-zero so that a realloc(X) followed by a
realloc(0) without a free() gives a small non-zero result.
It was verified that the results are stable including in the presence
of lots of malloc/realloc/free as happens when stressing Lua.
It would make sense to backport this to 2.4 as it helps in bug reports.
realloc() calls are painful to analyse because they have two non-zero
columns and trying to spot a leaking one requires a bit of scripting.
Let's simply append the delta at the end of the line when alloc and
free are non-nul.
It would be useful to backport this to 2.4 to help with bug reports.
In issue #1406, Lev Petrushchak reported a nasty memory leak on Alpine
since haproxy 2.4 when using Lua, that memory profiling didn't detect.
After inspecting the code and Lua's code, it appeared that Lua's default
allocator does an explicit free() on size zero, while since 2.4 commit
d36c7fa5e ("MINOR: lua: simplify hlua_alloc() to only rely on realloc()"),
haproxy only calls realloc(ptr,0) that performs a free() on glibc but not
on other systems as it's not required by POSIX...
This patch reinstalls the explicit test for nsize==0 to call free().
Thanks to Lev for the very documented report, and to Tim for the links
to a musl thread on the same subject that confirms the diagnostic.
This must be backported to 2.4.
Some browsers may send Initial packets with sizes greater than 1252 bytes
(QUIC_INITIAL_IPV4_MTU). Let us increase this size limit up to 2048 bytes.
Also use this size for "max_udp_payload_size" transport parameter to limit
the size of the datagrams we want to receive.
In issue 1424 Coverity reports that the loop increment is unreachable,
which is true, the list_for_each_entry() was replaced with a for loop,
but it was already not needed and was instead used as a convenient
construct for a single iteration lookup. Let's get rid of all this
now and replace the loop with an "if" statement.
While in H1 we can usually close quickly, in H2 a client might be sending
window updates or anything while we're sending a GOAWAY and the pending
data in the socket buffers at the moment the close() is performed on the
socket results in the output data being lost and an RST being emitted.
One example where this happens easily is with h2spec, which randomly
reports connection resets when waiting for a GOAWAY while haproxy sends
it, as seen in issue #1422. With h2spec it's not window updates that are
causing this but the fact that h2spec has to upload the payload that
comes with invalid frames to accommodate various implementations, and
does that in two different segments. When haproxy aborts on the invalid
frame header, the payload was not yet received and causes an RST to
be sent.
Here we're dealing with this two ways:
- we perform a shutdown(WR) on the connection to forcefully push pending
data on a front connection after the xprt is shut and closed ;
- we drain pending data
- then we close
This totally solves the issue with h2spec, and the extra cost is very
low, especially if we consider that H2 connections are not set up and
torn down often. This issue was never observed with regular clients,
most likely because this pattern does not happen in regular traffic.
After more testing it could make sense to backport this, at least to
avoid reporting errors on h2spec tests.
Sometimes we'd like to do our best to drain pending data before closing
in order to save the peer from risking to receive an RST on close.
This adds a new connection flag CO_FL_WANT_DRAIN that is used to
trigger a call to conn_ctrl_drain() from conn_ctrl_close(), and the
sock_drain() function ignores fd_recv_ready() if this flag is set,
in order to catch latest data. It's not used for now.
Some checks were added by commit 9a3d3fcb5 ("BUG/MAJOR: mux-h2: Don't try
to send data if we know it is no longer possible") to make sure we don't
loop forever trying to send data that cannot leave. But one of the
conditions there is not correct, the one relying on H2_CS_ERROR2. Indeed,
this state indicates that the error code was serialized into the mux
buffer, and since the test is placed before trying to send the data to
the socket, if the connection states only contains a GOAWAY frame, it
may refrain from sending and may close without sending anything. It's
not dramatic, as GOAWAY reports connection errors in situations where
delivery is not even certain, but it's cleaner to make sure the error
is properly sent, and it avoids upsetting h2spec, as seen in github
issue #1422.
Given that the patch above was backported as far as 1.8, this patch will
also have to be backported that far.
Thanks to Ilya for reporting this one.
This applicationn specific flag was added in 2.4-dev by commit 6fa8bcdc7
("MINOR: task: add an application specific flag to the state: TASK_F_USR1")
to help preserve a the idle connections status across wakeup calls. While
the code to do this was OK for tasklets, it was wrong for tasks, as in an
effort not to lose it when setting the RUNNING flag (that tasklets don't
have), it ended up being inconditionally set. It just happens that for now
no regular tasks use it, only tasklets.
This fix makes sure we always atomically perform (state & flags | running)
there, using a CAS. It also does it for tasklets because it was possible
to lose some such flags if set by another thread, even though this should
not happen with current code. In order to make the code more readable (and
avoid the previous mistake of repeated flags in the bit field), a new
TASK_PERSISTENT aggregate was declared in task.h for this.
In practice the CAS is cheap here because task states are stable or
convergent so the loop will almost never be taken.
This should be backported to 2.4.
The crash that was fixed by commit 7045590d8 ("BUG/MAJOR: dns: attempt
to lock globaly for msg waiter list instead of use barrier") was now
completely analysed and confirmed to be partially a result of the
debugging code added to LIST_INLIST(), which was looking at both
pointers and their reciprocals, and that, if used in a concurrent
context, could perfectly return false if a neighbor was being added or
removed while the current one didn't change, allowing the LIST_APPEND
to fail.
As the LIST API was not designed to be used in a concurrent context,
we should not rely on LIST_INLIST() but on the newly introduced
LIST_INLIST_ATOMIC().
This patch simply reverts the commit above to switch to the new test,
saving a lock during potentially long operations. It was verified that
the check doesn't fail anymore.
It is unsure what the performance impact of the fix above could be in
some contexts. If any performance regression is observed, it could make
sense to backport this patch, along with the previous commit introducing
the LIST_INLIST_ATOMIC() macro.
We're using an XXH32() on the record to insert it into or look it up from
the tree. This way we don't change the rest of the code, the comparisons
are still made on all fields and the next node is visited on mismatch. This
also allows to continue to use roundrobin between identical nodes.
Just doing this is sufficient to see the CPU usage go down from ~60-70% to
4% at ~2k DNS requests per second for farm with 300 servers. A larger
config with 12 backends of 2000 servers each shows ~8-9% CPU for 6-10000
DNS requests per second.
It would probably be possible to go further with multiple levels of indexing
but it's not worth it, and it's important to remember that tree nodes take
space (the struct answer_list went back from 576 to 600 bytes).
With SRV records, a huge amount of time is spent looking for records
by walking long lists. It is possible to reduce this by indexing values
in trees instead. However the whole code relies a lot on the list
ordering, and even implements some round-robin on it to distribute IP
addresses to servers.
This patch starts carefully by replacing the list with a an eb32 tree
that is still used like a list, with a constant key 0. Since ebtrees
preserve insertion order for duplicates, the tree walk visits the nodes
in the exact same order it did with the lists. This allows to implement
the required infrastructure without changing the behavior.
When cleaning up the code to remove most explicit task masks in commit
beeabf531 ("MINOR: task: provide 3 task_new_* wrappers to simplify the
API"), a mistake was done with the external checks where the call does
task_new_on(1) instead of task_new_on(0) due to the confusion with the
previous mask 1.
No backport is needed as that's only 2.5-dev.
This one was used to indicate whether the callee had to follow particularly
safe code path when removing resolutions. Since the code now uses a kill
list, this is not needed anymore.
When scanning resolution.curr it's possible to try to free some
resolutions which will themselves result in freeing other ones. If
one of these other ones is exactly the next one in the list, the list
walk visits deleted nodes and causes memory corruption, double-frees
and so on. The approach taken using the "safe" argument to some
functions seems to work but it's extremely brittle as it is required
to carefully check all call paths from process_ressolvers() and pass
the argument to 1 there to refrain from deleting entries, so the bug
is very likely to come back after some tiny changes to this code.
A variant was tried, checking at various places that the current task
corresponds to process_resolvers() but this is also quite brittle even
though a bit less.
This patch uses another approach which consists in carefully unlinking
elements from the list and deferring their removal by placing it in a
kill list instead of deleting them synchronously. The real benefit here
is that the complexity only has to be placed where the complications
are.
A thread-local list is fed with elements to be deleted before scanning
the resolutions, and it's flushed at the end by picking the first one
until the list is empty. This way we never dereference the next element
and do not care about its presence or not in the list. One function,
resolv_unlink_resolution(), is exported and used outside, so it had to
be modified to use this list as well. Internal code has to use
_resolv_unlink_resolution() instead.
The code as it is uses crossed lists between many elements, and at
many places the code relies on list iterators or emptiness checks,
which does not work with only LIST_DELETE. Further, it is quite
difficult to place debugging code and checks in the current situation,
and gdb is helpless.
This code replaces all LIST_DELETE calls with LIST_DEL_INIT so that
it becomes possible to trust the lists.
This function allocates requesters by hand for each and every type. This
is complex and error-prone, and it doesn't even initialize the list part,
leaving dangling pointers that complicate debugging.
This patch introduces a new function resolv_get_requester() that either
returns the current pointer if valid or tries to allocate a new one and
links it to its destination. Then it makes use of it in the function
above to clean it up quite a bit. This allows to remove complicated but
unneeded tests.
Similar to the previous patch, the answer's list was only initialized the
first time it was added to a list, leading to bogus outdated pointer to
appear when debugging code is added around it to watch it. Let's make
sure it's always initialized upon allocation.
The query_list is physically stored in the struct resolution itself,
so we have a list that contains a list to items stored in itself (and
there is a single item). But the list is first initialized in
resolv_validate_dns_response(), while it's scanned in
resolv_process_responses() later after calling the former. First,
this results in crashes as soon as the code is instrumented a little
bit for debugging, as elements from a previous incarnation can appear.
But in addition to this, the presence of an element is checked by
verifying that the return of LIST_NEXT() is not NULL, while it may
never be NULL even for an empty list, resulting in bugs or crashes
if the number of responses does not match the list's contents. This
is easily triggered by testing for the list non-emptiness outside of
the function.
Let's make sure the list is always correct, i.e. it's initialized to
an empty list when the structure is allocated, elements are checked by
first verifying the list is not empty, they are deleted once checked,
and in any case at end so that there are no dangling pointers.
This should be backported, but only as long as the patch fits without
modifications, as adaptations can be risky there given that bugs tend
to hide each other.
Depending on the code that precedes the loop, gcc may emit this warning:
src/resolvers.c: In function 'resolv_process_responses':
src/resolvers.c:1009:11: warning: potential null pointer dereference [-Wnull-dereference]
1009 | if (query->type != DNS_RTYPE_SRV && flags & DNS_FLAG_TRUNCATED) {
| ~~~~~^~~~~~
However after carefully checking, r_res->header.qdcount it exclusively 1
when reaching this place, which forces the for() loop to enter for at
least one iteration, and <query> to be set. Thus there's no code path
leading to a null deref. It's possibly just because the assignment is
too far and the compiler cannot figure that the condition is always OK.
Let's just mark it to please the compiler.
This code is dangerous enough that we certainly don't want external code
to ever approach it, let's not export unnecessary functions like this one.
It was made static and a comment was added about its purpose.
There is a fundamental design bug in the resolvers code which is that
a list of active resolutions is being walked to try to delete outdated
entries, and that the code responsible for removing them also removes
other elements, including the next one which will be visited by the
list iterator. This randomly causes a use-after-free condition leading
to crashes, infinite loops and various other issues such as random memory
corruption.
A first fix for the memory fix for this was brought by commit 0efc0993e
("BUG/MEDIUM: resolvers: Don't release resolution from a requester
callbacks"). While preparing for more fixes, some code was factored by
commit 11c6c3965 ("MINOR: resolvers: Clean server in a dedicated function
when removing a SRV item"), which inadvertently passed "0" as the "safe"
argument all the time, missing one case of removal protection, instead
of always using "safe". This patch reintroduces the correct argument.
This must be backported with all fixes above.
Cc: Christopher Faulet <cfaulet@haproxy.com>
A few places have been caught triggering late bugs recently, always cases
of use-after-free because a freed element was still found in one of the
lists. This patch adds a few checks for such elements in dns_session_free()
before the final pool_free() and dns_session_io_handler() before adding
elements to lists to make sure they remain consistent. They do not trigger
anymore now.
When dns_session_release() calls dns_session_free(), it was shown that
it might still be attached there:
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00000000006437d7 in dns_session_free (ds=0x7f895439e810) at src/dns.c:768
768 BUG_ON(!LIST_ISEMPTY(&ds->ring.waiters));
[Current thread is 1 (Thread 0x7f895bbe2700 (LWP 31792))]
(gdb) bt
#0 0x00000000006437d7 in dns_session_free (ds=0x7f895439e810) at src/dns.c:768
#1 0x0000000000643ab8 in dns_session_release (appctx=0x7f89545a4ff0) at src/dns.c:805
#2 0x000000000062e35a in si_applet_release (si=0x7f89545a5550) at include/haproxy/stream_interface.h:236
#3 0x000000000063150f in stream_int_shutw_applet (si=0x7f89545a5550) at src/stream_interface.c:1697
#4 0x0000000000640ab8 in si_shutw (si=0x7f89545a5550) at include/haproxy/stream_interface.h:437
#5 0x0000000000643103 in dns_session_io_handler (appctx=0x7f89545a4ff0) at src/dns.c:725
#6 0x00000000006d776f in task_run_applet (t=0x7f89545a5100, context=0x7f89545a4ff0, state=81924) at src/applet.c:90
#7 0x000000000068b82b in run_tasks_from_lists (budgets=0x7f895bbbf5c0) at src/task.c:611
#8 0x000000000068c258 in process_runnable_tasks () at src/task.c:850
#9 0x0000000000621e61 in run_poll_loop () at src/haproxy.c:2636
#10 0x0000000000622328 in run_thread_poll_loop (data=0x8d7440 <ha_thread_info+64>) at src/haproxy.c:2807
#11 0x00007f895c54a06b in start_thread () from /lib64/libpthread.so.0
#12 0x00007f895bf3772f in clone () from /lib64/libc.so.6
(gdb) p &ds->ring.waiters
$1 = (struct list *) 0x7f895439e8a8
(gdb) p ds->ring.waiters
$2 = {
n = 0x7f89545a5078,
p = 0x7f89545a5078
}
(gdb) p ds->ring.waiters->n
$3 = (struct list *) 0x7f89545a5078
(gdb) p *ds->ring.waiters->n
$4 = {
n = 0x7f895439e8a8,
p = 0x7f895439e8a8
}
Let's always detach it before freeing so that it remains possible to
check the dns_session's ring before releasing it, and possibly catch
bugs.
The barrier is insufficient here to protect the waiters list as we can
definitely catch situations where ds->waiter shows an inconsistency
whereby the element is not attached when entering the "if" block and
is already attached when attaching it later.
This patch uses a larger lock to maintain consistency. Without it the
code would crash in 30-180 minutes under heavy stress, always showing
the same problem (ds->waiter->n->p != &ds->waiter). Now it seems to
always resist, suggesting that this was indeed the problem.
This will have to be backported to 2.4.
Using tcp, after a session release and free, the session can remain
attached to the list of sessions with a response message waiting for
a commit (ds->waiter). This results to a use after free of this
session.
Also, on some error path and after free, a session could remain attached
to the lists of available idle/free sessions (ds->list).
This patch ensure to remove the session from those external lists
before a free.
This patch should be backported to all version including
the dns over tcp (2.4)
When an HTTP response is parsed, early parsing errors are not properly
handled. When this error is reported by the multiplexer, nothing is copied
into the input buffer. The HTX message remains empty but the
HTX_FL_PARSING_ERROR flag is set. In addition CS_FL_EOI is set on the
conn-stream. This last flag must be handled to prevent subscription for
receive events. Otherwise, in the best case, a L7 timeout error is
reported. But a transient loop is also possible if a shutdown is received
because the multiplexer notifies the check of the event while the check
never handles it and waits for more data.
Now, if CS_FL_EOI flag is set on the conn-stream, expect rules are
evaluated. Any error must be handled there.
Thanks to @kazeburo for his valuable report.
This patch should fix the issue #1420. It must be backported at least to
2.4. On 2.3 and 2.2, there is no loop but the wrong error is reported (empty
response instead of invalid one). Thus it may also be backported as far as
2.2.
If a channel error (READ_ERRO|READ_TIMEOUT|WRITE_ERROR|WRITE_TIMEOUT) is
detected by the stream, in process_stream(), FLT_END analyers must be
preserved. It is important to be sure to ends filter analysis and be able to
release the stream.
First, filters may release some ressources when FLT_END analyzers are
called. Then, the CF_FL_ANALYZE flag is used to sync end of analysis for the
request and the response. If FLT_END analyzer is ignored on a channel, this
may block the other side and freeze the stream.
This patch must be backported to all stable versions
Replace the test based on the enum value of the algorithm by an explicit
switch statement in case someone reorders it for some reason (while
still managing not to break the regtest).
resolv_hostname_cmp() is bogus, it is applied on labels and not plain
names, but doesn't make any distinction between length prefixes and
characters, so it compares the labels lengths via tolower() as well.
The only reason for which it doesn't break is because labels cannot
be larger than 63 bytes, and that none of the common encoding systems
have upper case letters in the lower 63 bytes, that could be turned
into a different value via tolower().
Now that all labels are stored in lower case, we don't need to burn
CPU cycles in tolower() at run time and can use memcmp() instead of
resolv_hostname_cmp(). This results in a ~22% lower CPU usage on large
farms using SRV records:
before:
18.33% haproxy [.] resolv_validate_dns_response
10.58% haproxy [.] process_resolvers
10.28% haproxy [.] resolv_hostname_cmp
7.50% libc-2.30.so [.] tolower
46.69% total
after:
24.73% haproxy [.] resolv_validate_dns_response
7.78% libc-2.30.so [.] __memcmp_avx2_movbe
3.65% haproxy [.] process_resolvers
36.16% total
The whole code relies on performing case-insensitive comparison on
lookups, which is extremely inefficient.
Let's make sure that all labels to be looked up or sent are first
converted to lower case. Doing so is also the opportunity to eliminate
an inefficient memcpy() in resolv_dn_label_to_str() that essentially
runs over a few unaligned bytes at once. As a side note, that call
was dangerous because it relied on a sign-extended size taken from a
string that had to be sanitized first.
This is tagged medium because while this is 100% safe, it may cause
visible changes on the wire at the packet level and trigger bugs in
test programs.
Coverity reported in issue #1416 that label oom3 is not reachable in
function close_listener() added by commit 59a877dfd ("MINOR: listeners:
add clone_listener() to duplicate listeners at boot time"). The code
leading to it was removed during the development of the function, but
not the label itself.
Coverity noticed in issue #1416 that a missing allocation error was
introduced in tcp_bind_listener() with the rework of error messages by
commit ed1748553 ("MINOR: proto_tcp: use chunk_appendf() to ouput socket
setup errors"). In practice nobody will ever face it but better address
it anyway.
No backport is needed.
When the clone_listener() function was added in commit 59a877dfd
("MINOR: listeners: add clone_listener() to duplicate listeners at
boot time"), a stupid bug was introduced when splitting the error
path because while the first case where calloc fails will leave NULL
in the output value, the other cases will return the pointer to a
freed area. This was reported by Coverity in issue #1416.
In practice nobody will face it (out-of-memory while checking config),
but let's fix it.
No backport is needed.
Commit 7a06ffb85 ("BUG/MEDIUM: sample: Cumulate frontend and backend
sample validity flags") introduced a typo confusing the request and
the response direction when checking for validity of a rule applied
to a backend. This was reported by Coverity in issue #1417.
This needs to be backported where the patch above is backported.
Fix the macro used to retrieve the max number of cpus on FreeBSD. The
MAXCPU is not properly defined in userspace and always set to 1 despite
the machine architecture. Replace it with CPU_SETSIZE.
See https://freebsd-hackers.freebsd.narkive.com/gw4BeLum/smp-in-machine-params-h#post6
Without this, the following config file is rejected on FreeBSD even if
the machine is SMP :
global
cpu-map 1-2 0-1
This must be backported up to 2.4.
It is now possible to have TCP/HTTP rules and ACLs defined in defaults
sections. So we must try to release corresponding lists when a default proxy
is destroyed.
No backport needed.
When the sample validity flags are computed to check if a sample is used in
a valid scope, the flags depending on the proxy capabilities must be
cumulated. Historically, for a sample on the request, only the frontend
capability was used to set the sample validity flags while for a sample on
the response only the backend was used. But it is a problem for listen or
defaults proxies. For those proxies, all frontend and backend samples should
be valid. However, at many place, only frontend ones are possible.
For instance, it is impossible to set the backend name (be_name) into a
variable from a listen proxy.
This bug exists on all stable versions. Thus this patch should probably be
backported. But with some caution because the code has probably changed
serveral times. Note that nobody has ever noticed this issue. So the need to
backport this patch must be evaluated for each branch.
As for TCP rules, HTTP rules from defaults section are now evaluated. These
rules are evaluated before those of the proxy. The same default ruleset
cannot be attached to the frontend and the backend. However, at this stage,
we take care to not execute twice the same ruleset. So, in theory, a
frontend and a backend could use the same defaults section. In this case,
the default ruleset is executed before all others and only once.
TCP rules from defaults section are now evaluated. These rules are evaluated
before those of the proxy. For L7 TCP rules, the same default ruleset cannot
be attached to the frontend and the backend. However, at this stage, we take
care to not execute twice the same ruleset. So, in theory, a frontend and a
backend could use the same defaults section. In this case, the default
ruleset is executed before all others and only once.
TCP and HTTP rules can now be defined in defaults sections, but only those
with a name. Because these rules may use conditions based on ACLs, ACLs can
also be defined in defaults sections.
However there are some limitations:
* A defaults section defining TCP/HTTP rules cannot be used by a defaults
section
* A defaults section defining TCP/HTTP rules cannot be used bu a listen
section
* A defaults sections defining TCP/HTTP rules cannot be used by frontends
and backends at the same time
* A defaults sections defining 'tcp-request connection' or 'tcp-request
session' rules cannot be used by backends
* A defaults sections defining 'tcp-response content' rules cannot be used
by frontends
The TCP request/response inspect-delay of a proxy is now inherited from the
defaults section it uses. For now, these rules are only parsed. No evaluation is
performed.
With the commit eaba25dd9 ("BUG/MINOR: tcpcheck: Don't use arg list for
default proxies during parsing"), we restricted the use of sample fetch in
tcpcheck rules defined in a defaults section to those depending on explicit
arguments only. This means a tcpcheck rules defined in a defaults section
cannot rely on argument unresolved during the configuration parsing.
Thanks to recent changes, it is now possible again.
This patch is mandatory to support TCP/HTTP rules in defaults sections.
When the parsing of a defaults section is started, the previous anonymous
defaults section is removed. It may be a problem with referenced defaults
sections. And because all unused defautl proxies are removed after the
configuration parsing, it is not required to remove it so early.
This patch is mandatory to support TCP/HTTP rules in defaults sections.
If a not-ready default proxy is referenced by a proxy during the
configuration validity check, its configuration is also finished and
PR_FL_READY flag is set on it.
For now, the arguments resolution is the only step performed.
This patch is mandatory to support TCP/HTTP rules in defaults sections.
The PR_FL_READY flags must now be set on a proxy at the end of the
configuration validity check to notify it is fully configured and may be
safely used.
For now there is no real usage of this flag. But it will be usefull for
referenced default proxies to finish their configuration only once.
This patch is mandatory to support TCP/HTTP rules in defaults sections.
A proxy may now references the defaults section it is used. To do so, a
pointer on the default proxy was added in the proxy structure. And a
refcount must be used to track proxies using a default proxy. A default
proxy is destroyed iff its refcount is equal to zero and when it drops to
zero.
All this stuff must be performed during init/deinit staged for now. All
unreferenced default proxies are removed after the configuration parsing.
This patch is mandatory to support TCP/HTTP rules in defaults sections.
It is now possible to designate the defaults section to use by adding a name
of the corresponding defaults section and referencing it in the desired
proxy section. However, this introduces an ambiguity. This named defaults
section may still be implicitly used by other proxies if it is the last one
defined. In this case for instance:
default common
...
default frt from common
...
default bck from common
...
frontend fe from frt
...
backend be from bck
...
listen stats
...
Here, it is not really obvious the last section will use the 'bck' defaults
section. And it is probably not the expected behaviour. To help users to
properly configure their haproxy, a warning is now emitted if a defaults
section is explicitly AND implicitly used. The configuration manual was
updated accordingly.
Because this patch adds a warning, it should probably not be backported to
2.4. However, if is is backported, it depends on commit "MINOR: proxy:
Introduce proxy flags to replace disabled bitfield".
It is not yet used but thanks to this patch, it will be possible to resolve
arguments found in defaults sections. However, there is some restrictions:
* For FE (frontend) or BE (backend) arguments, if the proxy is explicity
defined, there is no change. But for implicit proxy (not specified), the
argument points on the default proxy. when a sample fetch using this
kind of argument is evaluated, the default proxy replaced by the current
one.
* For SRV (server) and TAB (stick-table)arguments, the proxy must always
be specified. Otherwise an error is reported.
This patch is mandatory to support TCP/HTTP rules in defaults sections.
This change is required to support TCP/HTTP rules in defaults sections. The
'disabled' bitfield in the proxy structure, used to know if a proxy is
disabled or stopped, is replaced a generic bitfield named 'flags'.
PR_DISABLED and PR_STOPPED flags are renamed to PR_FL_DISABLED and
PR_FL_STOPPED respectively. In addition, everywhere there is a test to know
if a proxy is disabled or stopped, there is now a bitwise AND operation on
PR_FL_DISABLED and/or PR_FL_STOPPED flags.
.disabled field in the proxy structure is documented to be a bitfield. So
use it as a bitfield. This change was introduced to the 2.5, by commit
8e765b86f ("MINOR: proxy: disabled takes a stopping and a disabled state").
No backport is needed except if the above commit is backported.
The test on the return value of fix_tag_value() function was inverted. To
wait for more data, the return value must be a valid empty string and not
IST_NULL.
This patch must be backported to 2.4.
http-after-response rules evaluation must be stopped after a "allow". It
means the frontend ruleset must not be evaluated if a "allow" was performed
in the backend ruleset. Internally, the evaluation must be stopped if on
HTTP_RULE_RES_STOP return value. Only the "allow" action is concerned by
this change.
Thanks to this patch, http-response and http-after-response behave in the
same way.
This patch should be backported as far as 2.2.
This is the same as for commit 468c000db ("BUG/MEDIUM: jwt: fix base64
decoding error detection"), but for function sample_conv_jwt_member_query()
that is used by sample converters jwt_header_query() and jwt_payload_query().
Thanks to Tim for the report. No backport is needed.
As Tim reported in github issue #1414, we ought to use a constant-time
memcmp() when comparing hashes to avoid time-based attacks. Let's use
CRYPTO_memcmp() since this code already depends on openssl.
No backport is needed, this was just merged into 2.5.
Tim reported that a decoding error from the base64 function wouldn't
be matched in case of bad input, and could possibly cause trouble
with -1 being passed in decoded_sig->data. In the case of HMAC+SHA
it is harmless as the comparison is made using memcmp() after checking
for length equality, but in the case of RSA/ECDSA this result is passed
as a size_t to EVP_DigetVerifyFinal() and may depend on the lib's mood.
The fix simply consists in checking the intermediary result before
storing it.
That's precisely what happens with one of the regtests which returned
0 instead of 4 on the intentionally defective token, so the regtest
was fixed as well.
No backport is needed as this is new in this release.
A bug was introduced by commit previous bf9498a31 ("MINOR: resolvers:
fix the resolv_str_to_dn_label() API about trailing zero") as the code
is particularly contrived and hard to test. The output writes the last
char at [i+1] so the trailing zero and return value must be at i+1.
This will have to be backported where the patch above is backported
since it was needed for a fix.
These two fields are exclusive as they depend on the data type.
Let's move them into a union to save some precious bytes. This
reduces the struct resolv_answer_item size from 600 to 576 bytes.
The struct resolv_answer_item contains an address field of type
"sockaddr" which is only 16 bytes long, but which is used to store
either IPv4 or IPv6. Fortunately, the contents only overlap with
the "target" field that follows it and that is large enough to
absorb the extra bytes needed to store AAAA records. But this is
dangerous as just moving fields around could result in memory
corruption.
The fix uses a union and removes the casts that were used to hide
the problem.
Older versions need to be checked and possibly fixed. This needs
to be backported anyway.
In multi-threaded mode, on operating systems supporting multiple listeners on
the same IP:port, this will automatically create this number of multiple
identical listeners for the same line, all bound to a fair share of the number
of the threads attached to this listener. This can sometimes be useful when
using very large thread counts where the in-kernel locking on a single socket
starts to cause a significant overhead. In this case the incoming traffic is
distributed over multiple sockets and the contention is reduced. Note that
doing this can easily increase the CPU usage by making more threads work a
little bit.
If the number of shards is higher than the number of available threads, it
will automatically be trimmed to the number of threads. A special value
"by-thread" will automatically assign one shard per thread.
This function's purpose will be to duplicate a listener in INIT state.
This will be used to ease declaration of listeners spanning multiple
groups, which will thus require multiple FDs hence multiple receivers.
With groups at some point we'll have to have distinct masks/groups in the
receiver and the bind_conf, because a single bind_conf might require to
instantiate multiple receivers (one per group).
Let's split the thread mask and group to have one for the bind_conf and
another one for the receiver while it remains easy to do. This will later
allow to use different storage for the bind_conf if needed (e.g. support
multiple groups).
This function suffers from the same API issue as its sibling that does the
opposite direction, it demands that the input string is zero-terminated
*and* that its length *including* the trailing zero is passed on input,
forcing callers to pass length + 1, and itself to use that length - 1
everywhere internally.
This patch addressess this. There is a single caller, which is the
location of the previous bug, so it should probably be backported at
least to keep the code consistent across versions. Note that the
function is called dns_dn_label_to_str() in 2.3 and earlier.
An off-by-one issue in buffer size calculation used to limit the output
of resolv_dn_label_to_str() to 254 instead of 255.
This must be backported to 2.0.
In issue #1411, @jjiang-stripe reports that do-resolve() sometimes seems
to be trying to resolve crap from random memory contents.
The issue is that action_prepare_for_resolution() tries to measure the
input string by itself using strlen(), while resolv_action_do_resolve()
directly passes it a pointer to the sample, omitting the known length.
Thus of course any other header present after the host in memory are
appended to the host value. It could theoretically crash if really
unlucky, with a buffer that does not contain any zero including in the
index at the end, and if the HTX buffer ends on an allocation boundary.
In practice it should be too low a probability to have ever been observed.
This patch modifies the action_prepare_for_resolution() function to take
the string length on with the host name on input and pass that down the
chain. This should be backported to 2.0 along with commit "MINOR:
resolvers: fix the resolv_str_to_dn_label() API about trailing zero".
This function is bogus at the API level: it demands that the input string
is zero-terminated *and* that its length *including* the trailing zero is
passed on input. While that already looks smelly, the trailing zero is
copied as-is, and is then explicitly replaced with a zero... Not only
all callers have to pass hostname_len+1 everywhere to work around this
absurdity, but this requirement causes a bug in the do-resolve() action
that passes random string lengths on input, and that will be fixed on a
subsequent patch.
Let's fix this API issue for now.
This patch will have to be backported, and in versions 2.3 and older,
the function is in dns.c and is called dns_str_to_dn_label().
Some protocols fail with "error blah [ip:port]" and other fail with
"[ip:port] error blah". All this already appears in a "starting" or
"binding" context after a proxy name. Let's choose a more universal
approach like below where the ip:port remains at the end of the line
prefixed with "for".
[WARNING] (18632) : Binding [binderr.cfg:10] for proxy http: cannot bind receiver to device 'eth2' (No such device) for [0.0.0.0:1080]
[WARNING] (18632) : Starting [binderr.cfg:10] for proxy http: cannot set MSS to 12 for [0.0.0.0:1080]
Binding errors and late socket errors provide no information about
the file and line where the problem occurs. These are all done by
protocol_bind_all() and they only report "Starting proxy blah". Let's
change this a little bit so that:
- the file name and line number of the faulty bind line is alwas mentioned
- early binding errors are indicated with "Binding" instead of "Starting".
Now we can for example have this:
[WARNING] (18580) : Binding [binderr.cfg:10] for proxy http: cannot bind receiver to device 'eth2' (No such device) [0.0.0.0:1080]
The MSS errors are the only ones not indicating what was attempted, let's
report the value that was tried, as it can help users spot them in the
config (particularly if a default value was used).
Right now only the last warning or error is reported from
tcp_bind_listener(), but it is useful to report all warnings and no only
the last one, so we now emit them delimited by commas. Previously we used
a fixed buffer of 100 bytes, which was too small to store more than one
message, so let's extend it.
Signed-off-by: Bjoern Jacke <bjacke@samba.org>
This new converter takes a JSON Web Token, an algorithm (among the ones
specified for JWS tokens in RFC 7518) and a public key or a secret, and
it returns a verdict about the signature contained in the token. It does
not simply return a boolean because some specific error cases cas be
specified by returning an integer instead, such as unmanaged algorithms
or invalid tokens. This enables to distinguich malformed tokens from
tampered ones, that would be valid format-wise but would have a bad
signature.
This converter does not perform a full JWT validation as decribed in
section 7.2 of RFC 7519. For instance it does not ensure that the header
and payload parts of the token are completely valid JSON objects because
it would need a complete JSON parser. It only focuses on the signature
and checks that it matches the token's contents.
Those converters allow to extract a JSON value out of a JSON Web Token's
header part or payload part (the two first dot-separated base64url
encoded parts of a JWS in the Compact Serialization format).
They act as a json_query call on the corresponding decoded subpart when
given parameters, and they return the decoded JSON subpart when no
parameter is given.
A JWT signed with the RSXXX or ESXXX algorithm (RSA or ECDSA) requires a
public certificate to be verified and to ensure it is valid. Those
certificates must not be read on disk at runtime so we need a caching
mechanism into which those certificates will be loaded during init.
This is done through a dedicated ebtree that is filled during
configuration parsing. The path to the public certificates will need to
be explicitely mentioned in the configuration so that certificates can
be loaded as early as possible.
This tree is different from the ckch one because ckch entries are much
bigger than the public certificates used in JWT validation process.
This helper function splits a JWT under Compact Serialization format
(dot-separated base64-url encoded strings) into its different sub
strings. Since we do not want to manage more than JWS for now, which can
only have at most three subparts, any JWT that has strictly more than
two dots is considered invalid.
The full list of possible algorithms used to create a JWS signature is
defined in section 3.1 of RFC7518. This patch adds a helper function
that converts the "alg" strings into an enum member.
This fetch can be used to retrieve the data contained in an HTTP
Authorization header when the Bearer scheme is used. This is used when
transmitting JSON Web Tokens for instance.
On receiving CONNECTION_CLOSE frame, the mux is flagged for immediate
connection close. A stream is closed even if there is data not ACKed
left if CONNECTION_CLOSE has been received.
The mux tx buffers have been rewritten with buffers attached to qcs
instances. qc_buf_available and qc_get_buf functions are updated to
manipulates qcs. All occurences of the unused qcc ring buffer are
removed to ease the code maintenance.
Defer the shutting of a qcs if there is still data in its tx buffers. In
this case, the conn_stream is closed but the qcs is kept with a new flag
QC_SF_DETACH.
On ACK reception, the xprt wake up the shut_tl tasklet if the stream is
flagged with QC_SF_DETACH. This tasklet is responsible to free the qcs
and possibly the qcc when all bidirectional streams are removed.
For the moment, a quic connection is considered dead if it has no
bidirectional streams left on it. This test is implemented via
qcc_is_dead function. It can be reused to properly close the connection
when needed.
Properly handle tx buffers management in h3 data sending. If there is
not enough contiguous space, the buffer is first realigned. If this is
not enough, the stream is flagged with QC_SF_BLK_MROOM waiting for the
buffer to be emptied.
If a frame on a stream is successfully pushed for sending, the stream is
called if it was flagged with QC_SF_BLK_MROOM.
Remove the tx mux ring buffers in qcs, which should be in the qcc. For
the moment, use a simple architecture with 2 simple tx buffers in the
qcs.
The first buffer is used by the h3 layer to prepare the data. The mux
send operation transfer these into the 2nd buffer named xprt_buf. This
buffer is only freed when an ACK has been received.
This architecture is functional but not optimal for two reasons :
- it won't limit the buffer usage by connection
- each transfer on a new stream requires an allocation
This new ssllib_name_startswith precondition check can be used to
distinguish application linked with OpenSSL from the ones linked with
other SSL libraries (LibreSSL or BoringSSL namely). This check takes a
string as input and returns 1 when the SSL library's name starts with
the given string. It is based on the OpenSSL_version function which
returns the same output as the "openssl version" command.
Set an `lua_atpanic()` handler before calling `hlua_prepend_path()` in
`hlua_config_prepend_path()`.
This prevents the process from abort()ing when `hlua_prepend_path()` fails
for some reason.
see GitHub Issue #1409
This is a very minor issue that can't happen in practice. No backport needed.
This line is not related to the response channel but to the stream. Thus it
must be indented at the same level as stream-interfaces, connections,
channels...
Filters can block the stream on pre/post analysis for any reason and it can
be useful to report it in "show sess all". So now, a "current_filter" extra
line is reported for each channel if a filter is blocking the analysis. Note
that this does not catch the TCP/HTTP payload analysis because all
registered filters are always evaluated when more data are received.
Sometimes an HTTP or TCP rule may take time to complete because it is
waiting for external data (e.g. "wait-for-body", "do-resolve"), and it
can be useful to report the action and the location of that rule in
"show sess all". Here for streams blocked on such a rule, there will
now be a "current_line" extra line reporting this. Note that this does
not catch rulesets which are re-evaluated from the start on each change
(e.g. tcp-request content waiting for changes) but only when a specific
rule is being paused.
These ones are passed on rule creation for the sole purpose of being
reported in "show sess", which is not done yet. For now the entries
are allocated upon rule creation and freed in free_act_rules().
Rules are currently allocated using calloc() by their caller, which does
not make it very convenient to pass more information such as the file
name and line number.
This patch introduces new_act_rule() which performs the malloc() and
already takes in argument the ruleset (ACT_F_*), the file name and the
line number. This saves the caller from having to assing ->from, and
will allow to improve the internal storage with more info.
There have been a large number of issues reported with conn_cur
synchronization because the concept is wrong. In an active-passive
setup, pushing the local connections count from the active node to
the passive one will result in the passive node to have a higher
counter than the real number of connections. Due to this, after a
switchover, it will never be able to close enough connections to
go down to zero. The same commonly happens on reloads since the new
process preloads its values from the old process, and if no connection
happens for a key after the value is learned, it is impossible to reset
the previous ones. In active-active setups it's a bit different, as the
number of connections reflects the number on the peer that pushed last.
This patch solves this by marking the "conn_cur" local and preventing
it from being learned from peers. It is still pushed, however, so that
any monitoring system that collects values from the peers will still
see it.
The patch is tiny and trivially backportable. While a change of behavior
in stable branches is never welcome, it remains possible to fix issues
if reports become frequent.
In the configuration sometimes we'll omit a thread group number to designate
a global thread number range, and sometimes we'll mention the group and
designate IDs within that group. The operation is more complex than it
seems due to the need to check for ranges spanning between multiple groups
and determining groups from threads from bit masks and remapping bit masks
between local/global.
This patch adds a function to perform this operation, it takes a group and
mask on input and updates them on output. It's designed to be used by "bind"
lines but will likely be usable at other places if needed.
For situations where specified threads do not exist in the group, we have
the choice in the code between silently fixing the thread set or failing
with a message. For now the better option seems to return an error, but if
it turns out to be an issue we can easily change that in the future. Note
that it should only happen with "x/even" when group x only has one thread.
This extends the "thread" statement of bind lines to support an optional
thread group number. When unspecified (0) it's an absolute thread range,
and when specified it's one relative to the thread group. Masks are still
used so no more than 64 threads may be specified at once, and a single
group is possible. The directive is not used for now.
Now thread dumps will report the thread group number and the ID within
this group. Note that this is still quite limited because some masks
are calculated based on the thread in argument while they have to be
performed against a group-level thread ID.
This is the equivalent of "tid" for ease of access. In the future if we
make th_cfg a pure thread-local array (not a pointer), it may make sense
to move it there.
ha_set_tid() was randomly used either to explicitly set thread 0 or to
set any possibly incomplete thread during boot. Let's replace it with
a pointer to a valid thread or NULL for any thread. This allows us to
check that the designated threads are always valid, and to ignore the
thread 0's mapping when setting it to NULL, and always use group 0 with
it during boot.
The initialization code is also cleaner, as we don't pass ugly casts
of a thread ID to a pointer anymore.
This will be a convenient way to communicate the thread ID and its
local ID in the group, as well as their respective bits when creating
the threads or when only a pointer is given.
This will ease the reporting of the current thread group ID when coming
from the thread itself, especially since it returns the visible ID,
starting at 1.
This takes care of unassigned threads groups and places unassigned
threads there, in a more or less balanced way. Too sparse allocations
may still fail though. For now with a maximum group number fixed to 1
nothing can really fail.
This registers a mapping of threads to groups by enumerating for each thread
what group it belongs to, and marking the group as assigned. It takes care of
checking for redefinitions, overlaps, and holes. It supports both individual
numbers and ranges. The thread group is referenced from the thread config.
This creates a struct tgroup_info which knows the thread ID of the first
thread in a group, and the number of threads in it. For now there's only
one thread group supported in the configuration, but it may be forced to
other values for development purposes by defining MAX_TGROUPS, and it's
enabled even when threads are disabled and will need to remain accessible
during boot to keep a simple enough internal API.
For the purpose of easing the configurations which do not specify a thread
group, we're starting group numbering at 1 so that thread group 0 can be
"undefined" (i.e. for "bind" lines or when binding tasks).
The goal will be to later move there some global items that must be
made per-group.
We want to make sure that the current thread_info accessed via "ti" will
remain constant, so that we don't accidentally place new variable parts
there and so that the compiler knows that info retrieved from there is
not expected to have changed between two function calls.
Only a few init locations had to be adjusted to use the array and the
rest is unaffected.
The last 3 fields were 3 list heads that are per-thread, and which are:
- the pool's LRU head
- the buffer_wq
- the streams list head
Moving them into thread_ctx completes the removal of dynamic elements
from the struct thread_info. Now all these dynamic elements are packed
together at a single place for a thread.
The TI_FL_STUCK flag is manipulated by the watchdog and scheduler
and describes the apparent life/death of a thread so it changes
all the time and it makes sense to move it to the thread's context
for an active thread.
The "thread_info" name was initially chosen to store all info about
threads but since we now have a separate per-thread context, there is
no point keeping some of its elements in the thread_info struct.
As such, this patch moves prev_cpu_time, prev_mono_time and idle_pct to
thread_ctx, into the thread context, with the scheduler parts. Instead
of accessing them via "ti->" we now access them via "th_ctx->", which
makes more sense as they're totally dynamic, and will be required for
future evolutions. There's no room problem for now, the structure still
has 84 bytes available at the end.
The scheduler contains a lot of stuff that is thread-local and not
exclusively tied to the scheduler. Other parts (namely thread_info)
contain similar thread-local context that ought to be merged with
it but that is even less related to the scheduler. However moving
more data into this structure isn't possible since task.h is high
level and cannot be included everywhere (e.g. activity) without
causing include loops.
In the end, it appears that the task_per_thread represents most of
the per-thread context defined with generic types and should simply
move to tinfo.h so that everyone can use them.
The struct was renamed to thread_ctx and the variable "sched" was
renamed to "th_ctx". "sched" used to be initialized manually from
run_thread_poll_loop(), now it's initialized by ha_set_tid() just
like ti, tid, tid_bit.
The memset() in init_task() was removed in favor of a bss initialization
of the array, so that other subsystems can put their stuff in this array.
Since the tasklet array has TL_CLASSES elements, the TL_* definitions
was moved there as well, but it's not a problem.
The vast majority of the change in this patch is caused by the
renaming of the structures.
We used to remap SI_TKILL to SI_LWP when SI_TKILL was not available
(e.g. FreeBSD) but that's ugly and since we need this only in a single
switch/case block in wdt.c it's even simpler and cleaner to perform the
two tests there, so let's do this.
The watchdog timer had no more reason for being shared with the struct
thread_info since the watchdog is the only user now. Let's remove it
from the struct and move it to a static array in wdt.c. This removes
some ifdefs and the need for the ugly mapping to empty_t that might be
subject to a cast to a long when compared to TIMER_INVALID. Now timer_t
is not known outside of wdt.c and clock.c anymore.
This removes the knowledge of clockid_t from anywhere but clock.c, thus
eliminating a source of includes burden. The unused clock_id field was
removed from thread_info, and the definition setting of clockid_t was
removed from compat.h. The most visible change is that the function
now_cpu_time_thread() now takes the thread number instead of a tinfo
pointer.
The code that deals with timer creation for the WDT was moved to clock.c
and is called with the few relevant arguments. This removes the need for
awareness of clock_id from wdt.c and as such saves us from having to
share it outside. The timer_t is also known only from both ends but not
from the public API so that we don't have to create a fake timer_t
anymore on systems which do not support it (e.g. macos).
This was previously open-coded in run_thread_poll_loop(). Now that
we have clock.c dedicated to such stuff, let's move the code there
so that we don't need to keep such ifdefs nor to depend on the
clock_id.
Instead of fiddling with before_poll and after_poll in
activity_count_runtime(), the function is now called by
clock_entering_poll() which passes it the number of microseconds
spent working. This allows to remove all calls to
activity_count_runtime() from the pollers.
The entering_poll/leaving_poll/measure_idle functions that were hard
to classify and used to move to various locations have now been placed
into clock.c since it's precisely about time-keeping. The functions
were renamed to clock_*. The samp_time and idle_time values are now
static since there is no reason for them to be read from outside.
There is currently a problem related to time keeping. We're mixing
the functions to perform calculations with the os-dependent code
needed to retrieve and adjust the local time.
This patch extracts from time.{c,h} the parts that are solely dedicated
to time keeping. These are the "now" or "before_poll" variables for
example, as well as the various now_*() functions that make use of
gettimeofday() and clock_gettime() to retrieve the current time.
The "tv_*" functions moved there were also more appropriately renamed
to "clock_*".
Other parts used to compute stolen time are in other files, they will
have to be picked next.
It was brought by a variable declared after some statements in commit
21185970c ("MINOR: proc: setting the process to produce a core dump on
FreeBSD."). It's worth noting that some versions of clang seem to ignore
-Wdeclaration-after-statement by default. No backport is needed.
Remove unused code in mux-quic. This is mostly code related to the
backend side. This code is untested for the moment, its removal will
simplify the code maintenance.
Remove an unneeded strdup invocation during QPACK huffman decoding. A
temporary storage buffer is passed by the function and exists after
decoding so no need to duplicate memory here.
We've found others places where the read0 is ignored because of an
incomplete frame parsing. This time, it happens during parsing of
CONTINUATION frames.
When frames are parsed, incomplete frames are properly handled and
H2_CF_DEM_SHORT_READ flag is set. It is also true for HEADERS
frames. However, for CONTINUATION frames, there is an exception. Besides
parsing the current frame, we try to peek header of the next one to merge
payload of both frames, the current one and the next one. Idea is to create
a sole HEADERS frame before parsing the payload. However, in this case, it
is possible to have an incomplete frame too, not the current one but the
next one. From the demux point of view, the current frame is complete. We
must go to the internal function h2c_decode_headers() to detect an
incomplete frame. And this case was not identified and fixed when
H2_CF_DEM_SHORT_READ flag was introduced in the commit b5f7b5296
("BUG/MEDIUM: mux-h2: Handle remaining read0 cases on partial frames")
This bug was reported in a comment of the issue #1362. The patch must be
backported as far as 2.0.
Remove the quic_conn from the receiver connection_ids tree on
quic_conn_free. This fixes a crash due to dangling references in the
tree after a quic connection release.
This operation must be conducted under the listener lock. For this
reason, the quic_conn now contains a reference to its attached listener.
Use the count of bidirectional streams to call qc_release in qc_detach.
We cannot inspect the by_id tree because uni-streams are never removed
from it. This allows the connection to be properly freed.
It is required that all qcs streams are in the by_id tree for the xprt
to function correctly. Without this, some ACKs are not properly emitted
by xprt.
Note that this change breaks the free of the connection because the
condition eb_is_empty in qc_detach is always true. This will be fixed in
a following patch.
It seems it was a bad idea to use the same function as for TCP ssl sockets
to initialize the SSL session objects for QUIC with ssl_bio_and_sess_init().
Indeed, this had as very bad side effects to generate SSL errors due
to the fact that such BIOs initialized for QUIC could not finally be controlled
via the BIO_ctrl*() API, especially BIO_ctrl() function used by very much other
internal OpenSSL functions (BIO_push(), BIO_pop() etc).
Others OpenSSL base QUIC implementation do not use at all BIOs to configure
QUIC connections. So, we decided to proceed the same way as ngtcp2 for instance:
only initialize an SSL object and call SSL_set_quic_method() to set its
underlying method. Note that calling this function silently disable this option:
SSL_OP_ENABLE_MIDDLEBOX_COMPAT.
We implement qc_ssl_sess_init() to initialize SSL sessions for QUIC connections
to do so with a retry in case of allocation failure as this is done by
ssl_bio_and_sess_init(). We also modify the code part for haproxy servers.
The "show pools" command provides some "allocated" and "used" estimates
on the pools objects, but this applies to the shared pool and the "used"
includes what is currently assigned to thread-local caches. It's possible
to know how much each thread uses, so let's dump the total size allocated
by thread caches as an estimate. It's only done when pools are enabled,
which explains why the patch adds quite a lot of ifdefs.
These ones are rarely used or only to waste CPU cycles waiting, and are
the last ones requiring system includes in thread.h. Let's uninline them
and move them to thread.c.
This removes the thread identifiers from struct thread_info and moves
them only in static array in thread.c since it's now the only file that
needs to touch it. It's also the only file that needs to include
pthread.h, beyond haproxy.c which needs it to start the poll loop. As
a result, much less system includes are needed and the LoC reduced by
around 3%.
haproxy.c still has to deal with pthread-specific low-level stuff that
is OS-dependent. We should not have to deal with this there, and we do
not need to access pthread anywhere else.
Let's move these 3 functions to thread.c and keep empty inline ones for
when threads are disabled.
It's not needed to inline it at all (one call per loop) and it introduces
dependencies, let's move it to fd.c.
Removing the few remaining includes that came with it further reduced
by ~0.2% the LoC and the build time is now below 6s.
It's pointless to inline this, it's called exactly once per poll loop,
and it depends on time.h which is quite deep. Let's move that to task.c
along with sched_report_idle().
The remaining large functions are those allocating/initializing and
occasionally freeing connections, conn_streams and sockaddr. Let's
move them to connection.c. In fact, cs_free() is the only one-liner
but let's move it along with the other ones since a call will be
small compared to the rest of the work done there.
The following inlined functions are particularly large (and probably not
inlined at all by the compiler), and together represent roughly half of
the file, while they're used at most once per connection. They were moved
to connection.c.
conn_upgrade_mux_fe, conn_install_mux_fe, conn_install_mux_be,
conn_install_mux_chk, conn_delete_from_tree, conn_init, conn_new,
conn_free
The following functions are quite heavy and have no reason to be kept
inlined:
srv_release_conn, srv_lookup_conn, srv_lookup_conn_next,
srv_add_to_idle_list
They were moved to server.c. It's worth noting that they're a bit
at the edge between server and connection and that maybe we could
create an idle-conn file for these in the near future.
We do not really need to have them inlined, and having xxhash.h included
by connection.h results in this 4700-lines file being processed 101 times
over the whole project, which accounts for 13.5% of the total size!
Additionally, half of the functions are only needed from connection.c.
Let's move the functions there and get rid of the painful include.
The build time is now down to 6.2s just due to this.
The hash type stored everywhere is XXH64_hash_t, which annoyingly forces
everyone to include the huge xxhash file. We know it's an uint64_t because
that's its purpose and the type is only made to abstract it on machines
where uint64_t is not availble. Let's switch the type to uint64_t
everywhere and avoid including xxhash from the type file.
This one is expensive in code size because it comes with xxhash.h at a
low level of dependency that's inherited at plenty of places, and for
a function does doesn't benefit from inlining and could possibly even
benefit from not being inline given that it's large and called from the
scheduler.
Moving it to activity.c reduces the LoC by 1.2% and the binary size by
~1kB.
This function has no reason for being inlined, it's called from non
critical places (once in pollers), is quite large and comes with
dependencies (time and freq_ctr). Let's move it to acitvity.c. That's
another 0.4% less LoC to build.
The idle time calculation stuff was moved to task.h by commit 6dfab112e
("REORG: sched: move idle time calculation from time.h to task.h") but
these two variables that are only maintained by task.{c,h} were still
left in time.{c,h}. They have to move as well.
These ones require openssl and are only built when it's enabled. There's
no point keeping them in sample.c when ssl_sample.c already deals with this
and the required includes. This also allows to remove openssl-compat.h
from sample.c and to further reduce the number of inclusions of openssl
includes, and the build time is now down to under 8 seconds.
These two counters were the only ones not in the global struct, while
the SSL freq counters or the req counts are already in it, this forces
stats.c to include ssl_sock just to know about them. Let's move them
over there with their friends. This reduces from 408 to 384 the number
of includes of opensslconf.h.
This one has nothing to do with ssl_sock as it manipulates the struct
server only. Let's move it to server.c and remove unneeded dependencies
on ssl_sock.h. This further reduces by 10% the number of includes of
opensslconf.h and by 0.5% the number of compiled lines.