Timeouts for dynamic resolutions are not handled at the stream level but by
the resolvers themself. It means there is no connect, client and server
timeouts defined on the internal proxy used by a resolver.
While it is not an issue for DNS resolution over UDP, it can be a problem
for resolution over TCP. New sessions are automatically created when
required, and killed on excess. But only established connections are
considered. Connecting ones are never killed. Because there is no conncet
timeout, we rely on the kernel to report a connection error. And this may be
quite long.
Because resolutions are periodically triggered, this may lead to an excess
of unusable sessions in connecting state. This also prevents HAProxy to
quickly exit on soft-stop. It is annoying, especially because there is no
reason to not set a connect timeout.
So to mitigate the issue, we now use the "resolve" timeout as connect
timeout for the internal proxy attached to a resolver.
This patch should be backported as far as 2.4.
Thanks to previous commit ("BUG/MEDIUM: dns: Kill idle DNS sessions during
stopping stage"), DNS idle sessions are killed on stopping staged. But the
task responsible to kill these sessions is running every 5 seconds. It
means, when HAProxy is stopped, we can observe a delay before the process
exits.
To reduce this delay, when the resolvers task is executed, all DNS idle
tasks are woken up.
This patch must be backported as far as 2.6.
There is no server timeout for DNS sessions over TCP. It means idle session
cannot be killed by itself. There is a task running peridically, every 5s,
to kill the excess of idle sessions. But the last one is never
killed. During the stopping stage, it is an issue since the dynamic
resolutions are no longer performed (2ec6f14c "BUG/MEDIUM: resolvers:
Properly stop server resolutions on soft-stop").
Before the above commit, during stopping stage, the DNS sessions were killed
when a resolution was triggered. Now, nothing kills these sessions. This
prevents the process to finish on soft-stop.
To fix this bug, the task killing excess of idle sessions now kill all idle
sessions on stopping stage.
This patch must be backported as far as 2.6.
When the log applet is executed while a shut is pending, the remaining
output data must always be consumed. Otherwise, this can prevent the stream
to exit, leading to a spinning loop on the applet.
It is 2.8-specific. No backport needed.
When the stats applet is executed while a shut is pending, the remaining
output data must always be consumed. Otherwise, this can prevent the stream
to exit, leading to a spinning loop on the applet.
It is 2.8-specific. No backport needed.
When the http-client applet is executed while a shut is pending, the
remaining output data must always be consumed. Otherwise, this can prevent
the stream to exit, leading to a spinning loop on the applet.
It is 2.8-specific. No backport needed.
When the cli applet is executed while a shut is pending, the remaining
output data must always be consumed. Otherwise, this can prevent the stream
to exit, leading to a spinning loop on the applet.
This patch should fix the issue #2107. It is 2.8-specific. No backport
needed.
An applet must never set SE_FL_EOS flag without SE_FL_EOI or SE_FL_ERROR
flags. Here, SE_FL_EOI flag was missing for "quit" or "_getsocks"
commands. Indeed, these commands are terminal.
This bug triggers a BUG_ON() recently added.
This patch is related to the issue #2107. It is 2.8-specific. No backport
needed.
When calls to strcpy() were replaced with calls to strlcpy2(), one of them
was replaced wrong, and the source and size were inverted. Correct that.
This should fix issue #2110.
These ones are genenerally harmless on modern compilers because the
compiler checks them. While gcc optimizes them away without even
referencing strcpy(), clang prefers to call strcpy(). Nevertheless they
prevent from enabling stricter checks so better remove them altogether.
They were all replaced by strlcpy2() and the size of the destination
which is always known there.
strcpy() is quite nasty but tolerable to copy constants, but here
it copies a variable path into a node in a code path that's not
trivial to follow given that it takes the node as the result of
a tree lookup. Let's get rid of it and mention where the entry
is retrieved.
As every time strncat() is used, it's wrong, and this one is no exception.
Users often think that the length applies to the destination except it
applies to the source and makes it hard to use correctly. The bug did not
have an impact because the length was preallocated from the sum of all
the individual lengths as measured by strlen() so there was no chance one
of them would change in between. But it could change in the future. Let's
fix it to use memcpy() instead for strings, or byte copies for delimiters.
No backport is needed, though it can be done if it helps to apply other
fixes.
Add code so that compression can be used for requests as well.
New compression keywords are introduced :
"direction" that specifies what we want to compress. Valid values are
"request", "response", or "both".
"type-req" and "type-res" define content-type to be compressed for
requests and responses, respectively. "type" is kept as an alias for
"type-res" for backward compatibilty.
"algo-req" specifies the compression algorithm to be used for requests.
Only one algorithm can be provided.
"algo-res" provides the list of algorithm that can be used to compress
responses. "algo" is kept as an alias for "algo-res" for backward
compatibility.
Make provision for being able to store both compression algorithms and
content-types to compress for both requests and responses. For now only
the responses one are used.
Make provision for storing the compression algorithm and the compression
context twice, one for requests, and the other for responses. Only the
response ones are used for now.
When a trace message for an applet is dumped, if the SC exists, the stream
always exists too. There is no way to attached an applet to a health-check.
So, we can use the unsafe version __sc_strm() to get the stream.
This patch is related to #2106. Not sure it will be enough for
Coverity. However, there is no bug here.
On startup/reload, startup_logs_init() will try to export startup logs shm
filedescriptor through the internal HAPROXY_STARTUPLOGS_FD env variable.
While memprintf() is used to prepare the string to be exported via
setenv(), str_fd argument (first argument passed to memprintf()) could
be non NULL as a result of HAPROXY_STARTUPLOGS_FD env variable being
already set.
Indeed: str_fd is already used earlier in the function to store the result
of getenv("HAPROXY_STARTUPLOGS_FD").
The issue here is that memprintf() is designed to free the 'out' argument
if out != NULL, and here we don't expect str_fd to be freed since it was
provided by getenv() and would result in memory violation.
To prevent any invalid free, we must ensure that str_fd is set to NULL
prior to calling memprintf().
This must be backported in 2.7 with eba6a54cd4 ("MINOR: logs: startup-logs
can use a shm for logging the reload")
People who use HAProxy as a process 1 in containers sometimes starts
other things from the program section. This is still not recommend as
the master process has minimal features regarding process management.
Environment variables are still inherited, even internal ones.
Since 2.7, it could provoke a crash when inheriting the
HAPROXY_STARTUPLOGS_FD variable.
Note: for future releases it should be better to clean the env and sets
a list of variable to be exported. We need to determine which variables
are used by users before.
Must be backported in 2.7.
Previously, ODCID were concatenated with the client address. This was
done to prevent a collision between two endpoints which used the same
ODCID.
Thanks to the two previous patches, first connection generated CID is
now directly derived from the client ODCID using a hash function which
uses the client source address from the same purpose. Thus, it is now
unneeded to concatenate client address to <odcid> quic-conn member.
This change allows to simplify the quic_cid structure management and
reduce its size which is important as it is embedded several times in
various structures such as quic_conn and quic_rx_packet.
This should be backported up to 2.7.
First connection CID generation has been altered. It is now directly
derived from client ODCID since previous commit :
commit 162baaff7a
MINOR: quic: derive first DCID from client ODCID
This patch removes the ODCID tree which is now unneeded. On connection
lookup via CID, if a DCID is not found the hash derivation is performed
for an INITIAL/0-RTT packet only. In case a client has used multiple
times an ODCID, this will allow to retrieve our generated DCID in the
CID tree without storing the ODCID node.
The impact of this two combined patch is that it may improve slightly
haproxy memory footprint by removing a tree node from quic_conn
structure. The cpu calculation induced by hash derivation should only be
performed only a few times per connection as the client will start to
use our generated CID as soon as it received it.
This should be backported up to 2.7.
Change the generation of the first CID of a connection. It is directly
derived from the client ODCID using a 64-bits hash function. Client
address is added to avoid collision between clients which could use the
same ODCID.
For the moment, this change as no functional impact. However, it will be
directly used for the next commit to be able to remove the ODCID tree.
This should be backported up to 2.7.
This is due to this commit:
MINOR: quic: Add trace to debug idle timer task issues
where has been added without having been tested at developer level.
<qc> was dereferenced after having been released by qc_conn_release().
Set qc to NULL value after having been released to forbid its dereferencing.
Add a check for qc->idle_timer_task in the traces added by the mentionned
commit above to prevent its dereferencing if NULL.
Take the opportunity of this patch to modify trace events from
QUIC_EV_CONN_SSLALERT to QUIC_EV_CONN_IDLE_TIMER.
Must be backported to 2.6 and 2.7.
The HTTP message must remains in BODY state during the analysis, to be able
to report accurate termination state in logs. It is also important to know
the HTTP analysis is still in progress. Thus, when we are waiting for the
message payload, the message is no longer switch to DATA state. This was
used to not process "Expect: " header at each evaluation. But thanks to the
previous patch, it is no long necessary.
This patch also fixes a bug in the lua filter api. Some functions must be
called during the message analysis and not during the payload forwarding. It
is not valid to try to manipulate headers during the forward stage because
headers are already forwarded. We rely on the message state to detect
errors. So the api was unusable if a "wait-for-body" action was used.
This patch shoud fix the issue #2093. It relies on the commit:
* MINOR: http-ana: Add a HTTP_MSGF flag to state the Expect header was checked
Both must be backported as far as 2.5.
HTTP_MSGF_EXPECT_CHECKED is now set on the request message to know the
"Expect: " header was already handled, if any. The flag is set from the
moment we try to handle the header to send a "100-continue" response,
whether it was found or not.
This way, when we are waiting for the request payload, thanks to this flag,
we only try to handle "Expect: " header only once. Before it was performed
by changing the message state from BODY to DATA. But this has some side
effects and it is no accurate. So, it is better to rely on a flag to do so.
Now that event_hdl api is properly implemented in hlua, we may add the
per-server event subscription in addition to the global event
subscription.
Per-server subscription allows to be notified for events related to
single server. It is useful to track a server UP/DOWN and DEL events.
It works exactly like core.event_sub() except that the subscription
will be performed within the server dedicated subscription list instead
of the global one.
The callback function will only be called for server events affecting
the server from which the subscription was performed.
Regarding the implementation, it is pretty trivial at this point, we add
more doc than code this time.
Usage examples have been added to the (lua) documentation.
Now that the event handler API is pretty mature, we can expose it in
the lua API.
Introducing the core.event_sub(<event_types>, <cb>) lua function that
takes an array of event types <event_types> as well as a callback
function <cb> as argument.
The function returns a subscription <sub> on success.
Subscription <sub> allows you to manage the subscription from anywhere
in the script.
To this day only the sub->unsub method is implemented.
The following event types are currently supported:
- "SERVER_ADD": when a server is added
- "SERVER_DEL": when a server is removed from haproxy
- "SERVER_DOWN": server states goes from up to down
- "SERVER_UP": server states goes from down to up
As for the <cb> function: it will be called when one of the registered
event types occur. The function will be called with 3 arguments:
cb(<event>,<data>,<sub>)
<event>: event type (string) that triggered the function.
(could be any of the types used in <event_types> when registering
the subscription)
<data>: data associated with the event (specific to each event family).
For "SERVER_" family events, server details such as server name/id/proxy
will be provided.
If the server still exists (not yet deleted), a reference to the live
server is provided to spare you from an additionnal lookup if you need
to have direct access to the server from lua.
<sub> refers to the subscription. In case you need to manage it from
within an event handler.
(It refers to the same subscription that the one returned from
core.event_sub())
Subscriptions are per-thread: the thread that will be handling the
event is the one who performed the subscription using
core.event_sub() function.
Each thread treats events sequentially, it means that if you have,
let's say SERVER_UP, then SERVER_DOWN in a short timelapse, then your
cb function will first be called with SERVER_UP, and once you're done
handling the event, your function will be called again with SERVER_DOWN.
This is to ensure event consitency when it comes to logging / triggering
logic from lua.
Your lua cb function may yield if needed, but you're pleased to process
the event as fast as possible to prevent the event queue from growing up
To prevent abuses, if the event queue for the current subscription goes
over 100 unconsumed events, the subscription will pause itself
automatically for as long as it takes for your handler to catch up.
This would lead to events being missed, so a warning will be emitted in
the logs to inform you about that. This is not something you want to let
happen too often, it may indicate that you subscribed to an event that
is occurring too frequently or/and that your callback function is too
slow to keep up the pace and you should review it.
If you want to do some parallel processing because your callback
functions are slow: you might want to create subtasks from lua using
core.register_task() from within your callback function to perform the
heavy job in a dedicated task and allow remaining events to be processed
more quickly.
Please check the lua documentation for more information.
Adding alternative findserver() functions to be able to perform an
unique match based on name or puid and by leveraging revision id (rid)
to make sure the function won't match with a new server reusing the
same name or puid of the "potentially deleted" server we were initially
looking for.
For example, if you were in the position of finding a server based on
a given name provided to you by a different context:
Since dynamic servers were implemented, between the time the name was
picked and the time you will perform the findserver() call some dynamic
server deletion/additions could've been performed in the mean time.
In such cases, findserver() could return a new server that re-uses the
name of a previously deleted server. Depending on your needs, it could
be perfectly fine, but there are some cases where you want to lookup
the original server that was provided to you (if it still exists).
While working on event handling from lua, the need for a pause/resume
function to temporarily disable a subscription was raised.
We solve this by introducing the EHDL_SUB_F_PAUSED flag for
subscriptions.
The flag is set via _pause() and cleared via _resume(), and it is
checked prior to notifying the subscription in publish function.
Pause and Resume functions are also available for via lookups for
identified subscriptions.
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
Use event_hdl_async_equeue_size() in advanced async task handler to
get the near real-time event queue size.
By near real-time, you should understand that the queue size is not
updated during element insertion/removal, but shortly before insertion
and shortly after removal, so the size should reflect the approximate
queue size at a given time but should definitely not be used as a
unique source of truth.
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
advanced async mode (EVENT_HDL_ASYNC_TASK) provided full support for
custom tasklets registration.
Due to the similarities between tasks and tasklets, it may be useful
to use the advanced mode with an existing task (not a tasklet).
While the API did not explicitly disallow this usage, things would
get bad if we try to wakeup a task using tasklet_wakeup() for notifying
the task about new events.
To make the API support both custom tasks and tasklets, we use the
TASK_IS_TASKLET() macro to call the proper waking function depending
on the task's type:
- For tasklets: we use tasklet_wakeup()
- For tasks: we use task_wakeup()
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
In _event_hdl_publish(), when publishing an event to async handler(s),
async_data is allocated only once and then relies on a refcount
logic to reuse the same data block for multiple async event handlers.
(this allows to save significant amount of memory)
Because the refcount is first set to 0, there is a small race where
the consumers could consume async data (async data refcount reaching 0)
before publishing is actually over.
The consequence is that async data may be freed by one of the consumers
while we still rely on it within _event_hdl_publish().
This was discovered by chance when stress-testing the API with multiple
async handlers registered to the same event: some of the handlers were
notified about a new event for which the event data was already freed,
resulting in invalid reads and/or segfaults.
To fix this, we first set the refcount to 1, assuming that the
publish function relies on async_data until the publish is over.
At the end of the publish, the reference to the async data is dropped.
This way, async_data is either freed by _event_hdl_publish() itself
or by one of the consumers, depending on who is the last one relying
on it.
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
soft-stop was not explicitly handled in event_hdl API.
Because of this, event_hdl was causing some leaks on deinit paths.
Moreover, a task responsible for handling events could require some
additional cleanups (ie: advanced async task), and as the task was not
protected against abort when soft-stopping, such cleanup could not be
performed unless the task itself implements the required protections,
which is not optimal.
Consider this new approach:
'jobs' global variable is incremented whenever an async subscription is
created to prevent the related task from being aborted before the task
acknowledges the final END event.
Once the END event is acknowledged and freed by the task, the 'jobs'
variable is decremented, and the deinit process may continue (including
the abortion of remaining tasks not guarded by the 'jobs' variable).
To do this, a new global mt_list is required: known_event_hdl_sub_list
This list tracks the known (initialized) subscription lists within the
process.
sub_lists are automatically added to the "known" list when calling
event_hdl_sub_list_init(), and are removed from the list with
event_hdl_sub_list_destroy().
This allows us to implement a global thread-safe event_hdl deinit()
function that is automatically called on soft-stop thanks to signal(0).
When event_hdl deinit() is initiated, we simply iterate against the known
subscription lists to destroy them.
event_hdl_subscribe_ptr() was slightly modified to make sure that a sub_list
may not accept new subscriptions once it is destroyed (removed from the
known list)
This can occur between the time the soft-stop is initiated (signal(0)) and
haproxy actually enters in the deinit() function (once tasks are either
finished or aborted and other threads already joined).
It is safe to destroy() the subscription list multiple times as long
as the pointer is still valid (ie: first on soft-stop when handling
the '0' signal, then from regular deinit() path): the function does
nothing if the subscription list is already removed.
We partially reverted "BUG/MINOR: event_hdl: make event_hdl_subscribe thread-safe"
since we can use parent mt_list locking instead of a dedicated lock to make
the check gainst duplicate subscription ID.
(insert_lock is not useful anymore)
The check in itself is not changed, only the locking method.
sizeof(event_hdl_sub_list) slightly increases: from 24 bits to 32bits due
to the additional mt_list struct within it.
With that said, having thread-safe list to store known subscription lists
is a good thing: it could help to implement additional management
logic for subcription lists and could be useful to add some stats or
debugging tools in the future.
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
event_hdl_sub_list_init() and event_hdl_sub_list_destroy() don't expect
to be called with a NULL argument (to use global subscription list
implicitly), simply because the global subscription list init and
destroy is internally managed.
Adding BUG_ON() to detect such invalid usages, and updating some comments
to prevent confusion around these functions.
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
List insertion in event_hdl_subscribe() was not thread-safe when dealing
with unique identifiers. Indeed, in this case the list insertion is
conditional (we check for a duplicate, then we insert). And while we're
using mt lists for this, the whole operation is not atomic: there is a
race between the check and the insertion.
This could lead to the same ID being registered multiple times with
concurrent calls to event_hdl_subscribe() on the same ID.
To fix this, we add 'insert_lock' dedicated lock in the subscription
list struct. The lock's cost is nearly 0 since it is only used when
registering identified subscriptions and the lock window is very short:
we only guard the duplicate check and the list insertion to make the
conditional insertion "atomic" within a given subscription list.
This is the only place where we need the lock: as soon as the item is
properly inserted we're out of trouble because all other operations on
the list are already thread-safe thanks to mt lists.
A new lock hint is introduced: LOCK_EHDL which is dedicated to event_hdl
The patch may seem quite large since we had to rework the logic around
the subscribe function and switch from simple mt_list to a dedicated
struct wrapping both the mt_list and the insert_lock for the
event_hdl_sub_list type.
(sizeof(event_hdl_sub_list) is now 24 instead of 16)
However, all the changes are internal: we don't break the API.
If 68e692da0 ("MINOR: event_hdl: add event handler base api")
is being backported, then this commit should be backported with it.
core.register_task(function) may now take up to 4 additional arguments
that will be passed as-is to the task function.
This could be convenient to spawn sub-tasks from existing functions
supporting core.register_task() without the need to use global
variables to pass some context to the newly created task function.
The new prototype is:
core.register_task(function[, arg1[, arg2[, ...[, arg4]]]])
Implementation remains backward-compatible with existing scripts.
Server revision ID was recently added to haproxy with 61e3894
("MINOR: server: add srv->rid (revision id) value")
Let's add it to the hlua server class.
Main lua lock is used at various places in the code.
Most of the time it is used from unprotected lua environments,
in which case the locking is mandatory.
But there are some cases where the lock is attempted from protected
lua environments, meaning that lock is already owned by the current
thread. Thus new locking attempt should be skipped to prevent any
deadlocks from occuring.
To address this, "already_safe" lock hint was implemented in
hlua_ctx_init() function with commit bf90ce1
("BUG/MEDIUM: lua: dead lock when Lua tasks are trigerred")
But this approach is not very safe, for 2 reasons:
First reason is that there are still some code paths that could lead
to deadlocks.
For instance, in register_task(), hlua_ctx_init() is called with
already_safe set to 1 to prevent deadlock from occuring.
But in case of task init failure, hlua_ctx_destroy() will be called
from the same environment (protected environment), and hlua_ctx_destroy()
does not offer the already_safe lock hint.. resulting in a deadlock.
Second reason is that already_safe hint is used to completely skip
SET_LJMP macros (which manipulates the lock internally), resulting
in some logics in the function being unprotected from lua aborts in
case of unexpected errors when manipulating the lua stack (the lock
does not protect against longjmps)
Instead of leaving the locking responsibility to the caller, which is
quite error prone since we must find out ourselves if we are or not in
a protected environment (and is not robust against code re-use),
we move the deadlock protection logic directly in hlua_lock() function.
Thanks to a thread-local lock hint, we can easily guess if the current
thread already owns the main lua lock, in which case the locking attempt
is skipped. The thread-local lock hint is implemented as a counter so
that the lock is properly dropped when the counter reaches 0.
(to match actual lock() and unlock() calls)
This commit depends on "MINOR: hlua: simplify lua locking"
It may be backported to every stable versions.
[prior to 2.5 lua filter API did not exist, filter-related parts
should be skipped]
The check on lua state==0 to know whether locking is required or not can
be performed in a locking wrapper to simplify things a bit and prevent
implementation errors.
Locking from hlua context should now be performed via hlua_lock(L) and
unlocking via hlua_unlock(L)
Using hlua_pushref() everywhere temporary lua objects are involved.
(ie: hlua_checkfunction(), hlua_checktable...)
Those references are expected to be cleared using hlua_unref() when
they are no longer used.
Using hlua_ref() everywhere temporary lua objects are involved.
Those references are expected to be cleared using hlua_unref()
when they are no longer used.
Several error paths were leaking function or table references.
(Obtained through hlua_checkfunction() and hlua_checktable() functions)
Now we properly release the references thanks to hlua_unref() in
such cases.
This commit depends on "MINOR: hlua: add simple hlua reference handling API"
This could be backported in every stable versions although it is not
mandatory as such leaks only occur on rare error/warn paths.
[prior to 2.5 lua filter API did not exist, the hlua_register_filter()
part should be skipped]
hlua init function references were not released during
hlua_post_init_state().
Hopefully, this function is only used during startup so the resulting
leak is not a big deal.
Since each init lua function runs precisely once, it is safe to release
the ref as soon as the function is restored on the stack.
This could be backported to every stable versions.
Please note that this commit depends on "MINOR: hlua: add simple hlua reference handling API"
In core.register_task(): we take a reference to the function passed as
argument in order to push it in the new coroutine substack.
However, once pushed in the substack: the reference is not useful
anymore and should be cleared.
Currently, this is not the case in hlua_register_task().
Explicitly dropping the reference once the function is pushed to the
coroutine's stack to prevent any reference leak (which could contribute
to resource shortage)
This may be backported to every stable versions.
Please note that this commit depends on "MINOR: hlua: add simple hlua reference handling API"
hlua_checktable() and hlua_checkfunction() both return the raw
value of luaL_ref() function call.
As luaL_ref() returns a signed int, both functions should return a signed
int as well to prevent any misuse of the returned reference value.
We're doing this in an attempt to simplify temporary lua objects
references handling.
Adding the new hlua_unref() function to release lua object references
created using luaL_ref(, LUA_REGISTRYINDEX)
(ie: hlua_checkfunction() and hlua_checktable())
Failure to release unused object reference prevents the reference index
from being re-used and prevents the referred ressource from being garbage
collected.
Adding hlua_pushref(L, ref) to replace
lua_rawgeti(L, LUA_REGISTRYINDEX, ref)
Adding hlua_ref(L) to replace luaL_ref(L, LUA_REGISTRYINDEX)
The comment for the hlua_ctx_destroy() function states that the "lua"
struct is not freed.
This is not true anymore since 2c8b54e7 ("MEDIUM: lua: remove Lua struct
from session, and allocate it with memory pools")
Updating the function comment to properly report the actual behavior.
This could be backported in every stable versions with 2c8b54e7
("MEDIUM: lua: remove Lua struct from session, and allocate it with memory pools")
Since ("MINOR: hlua_fcn: alternative to old proxy and server attributes"):
- s->name(), s->puid() are superseded by s->get_name() and s->get_puid()
- px->name(), px->uuid() are superseded by px->get_name() and
px->get_uuid()
And considering this is now the proper way to retrieve proxy name/uuid
and server name/puid from lua:
We're now removing such legacy attributes, but for retro-compatibility
purposes we will be emulating them and warning the user for some time
before completely dropping their support.
To do this, we first remove old legacy code.
Then we move server and proxy methods out of the metatable to allow
direct elements access without systematically involving the "__index"
metamethod.
This allows us to involve the "__index" metamethod only when the requested
key is missing from the table.
Then we define relevant hlua_proxy_index and hlua_server_index functions
that will be used as the "__index" metamethod to respectively handle
"name, uuid" (proxy) or "name, puid" (server) keys, in which case we
warn the user about the need to use the new getter function instead the
legacy attribute (to prepare for the potential upcoming removal), and we
call the getter function to return the value as if the getter function
was directly called from the script.
Note: Using the legacy variables instead of the getter functions results
in a slight overhead due to the "__index" metamethod indirection, thus
it is recommended to switch to the getter functions right away.
With this commit we're also adding a deprecation notice about legacy
attributes.
This patch proposes to enumerate servers using internal HAProxy list.
Also, remove the flag SRV_F_NON_PURGEABLE which makes the server non
purgeable each time Lua uses the server.
Removing reg-tests/cli_delete_server_lua.vtc since this test is no
longer relevant (we don't set the SRV_F_NON_PURGEABLE flag anymore)
and we already have a more generic test:
reg-tests/server/cli_delete_server.vtc
Co-authored-by: Aurelien DARRAGON <adarragon@haproxy.com>
This patch adds new lua methods:
- "Proxy.get_uuid()"
- "Proxy.get_name()"
- "Server.get_puid()"
- "Server.get_name()"
These methods will be equivalent to their old analog Proxy.{uuid,name}
and Server.{puid,name} attributes, but this will be the new preferred
way to fetch such infos as it duplicates memory only when necessary and
thus reduce the overall lua Server/Proxy objects memory footprint.
Legacy attributes (now superseded by the explicit getters) are expected
to be removed some day.
Co-authored-by: Aurelien DARRAGON <adarragon@haproxy.com>
When HAproxy is loaded with a lot of frontends/backends (tested with 300k),
it is slow to start and it uses a lot of memory just for indexing backends
in the lua tables.
This patch uses the internal frontend/backend index of HAProxy in place of
lua table.
HAProxy startup is now quicker as each frontend/backend object is created
on demand and not at init.
This has to come with some cost: the execution of Lua will be a little bit
slower.
Two lua init function seems to return something useful, but it
is not the case. The function "hlua_concat_init" seems to return
a failure status, but the function never fails. The function
"hlua_fcn_reg_core_fcn" seems to return a number of elements in
the stack, but it is not the case.
register_{init, converters, fetches, action, service, cli, filter} are
meant to run exclusively from body context according to the
documentation (unlike register_task which is designed to work from both
init and runtime contexts)
A quick code inspection confirms that only register_task implements
the required precautions to make it safe out of init context.
Trying to use those register_* functions from a runtime lua task will
lead to a program crash since they all assume that they are running from
the main lua context and with no concurrent runs:
core.register_task(function()
core.register_init(function()
end)
end)
When loaded from the config, the above example would segfault.
To prevent this undefined behavior, we now report an explicit error if
the user tries to use such functions outside of init/body context.
This should be backported in every stable versions.
[prior to 2.5 lua filter API did not exist, the hlua_register_filter()
part should be skipped]
In hlua_process_task: when HLUA_E_ETMOUT was returned by
hlua_ctx_resume(), meaning that the lua task reached
tune.lua.task-timeout (default: none),
we logged "Lua task: unknown error." before stopping the task.
Now we properly handle HLUA_E_ETMOUT to report a meaningful error
message.
In function hlua_hook, a yieldk is performed when function is yieldable.
But the following code in that function seems to assume that the yield
never returns, which is not the case!
Moreover, Lua documentation says that in this situation the yieldk call
must immediately be followed by a return.
This patch adds a return statement after the yieldk call.
It also adds some comments and removes a needless lua_sethook call.
It could be backported to all stable versions, but it is not mandatory,
because even if it is undefined behavior this bug doesn't seem to
negatively affect lua 5.3/5.4 stacks.
srv_drop() function is reponsible for freeing the server when the
refcount reaches 0.
There is one exception: when global.mode has the MODE_STOPPING flag set,
srv_drop() will ignore the refcount and free the server on first
invocation.
This logic has been implemented with 13f2e2ce ("BUG/MINOR: server: do
not use refcount in free_server in stopping mode") and back then doing
so was not a problem since dynamic server API was just implemented and
srv_take() and srv_drop() were not widely used.
Now that dynamic server API is starting to get more popular we cannot
afford to keep the current logic: some modules or lua scripts may hold
references to existing server and also do their cleanup in deinit phases
In this kind of situation, it would be easy to trigger double-frees
since every call to srv_drop() on a specific server will try to free it.
To fix this, we take a different approach and try to fix the issue at
the source: we now properly drop server references involved with
checks/agent_checks in deinit_srv_check() and deinit_srv_agent_check().
While this could theorically be backported up to 2.6, it is not very
relevant for now since srv_drop() usage in older versions is very
limited and we're only starting to face the issue in mid 2.8
developments. (ie: lua core updates)
In srv_drop(), we only call the ssl->destroy_srv() method on
specific conditions.
But this has two downsides:
First, destroy_srv() is reponsible for freeing data that may have been
allocated in prepare_srv(), but not exclusively: it also frees
ssl-related parameters allocated when parsing a server entry, such as
ca-file for instance.
So this is quite error-prone, we could easily miss a condition where
some data needs to be deallocated using destroy_srv() even if
prepare_srv() was not used (since prepare_srv() is also conditional),
thus resulting in memory leaks.
Moreover, depending on srv->proxy to guard the check is probably not
a good idea here, since srv_drop() could be called in late de-init paths
in which related proxy could be freed already. srv_drop() should only
take care of freeing local server data without external logic.
Thankfully, destroy_srv() function performs the necessary checks to
ensure that a systematic call to the function won't result in invalid
reads or double frees.
No backport needed.
Proxies belonging to the cfg_log_forward proxy list are not cleaned up
in haproxy deinit() function.
We add the missing cleanup directly in the main deinit() function since
no other specific function may be used for this.
This could be backported up to 2.4
When a ring section is configured, a new sink is created and forward_px
proxy may be allocated and assigned to the sink.
Such sink-related proxies are added to the sink_proxies_list and thus
don't belong to the main proxy list which is cleaned up in
haproxy deinit() function.
We don't have to manually clean up sink_proxies_list in the main deinit()
func:
sink API already provides the sink_deinit() function so we just add the
missing free_proxy(sink->forward_px) there.
This could be backported up to 2.4.
[in 2.4, commit b0281a49 ("MINOR: proxy: check if p is NULL in free_proxy()")
must be backported first]
In stats_dump_proxy_to_buffer() function, special care was taken when
dealing with servers dump.
Indeed, stats_dump_proxy_to_buffer() can be interrupted and resumed if
buffer space is not big enough to complete dump.
Thus, a reference is taken on the server being dumped in the hope that
the server will still be valid when the function resumes.
(to prevent the server from being freed in the meantime)
While this is now true thanks to:
- "BUG/MINOR: server/del: fix legacy srv->next pointer consistency"
We still have an issue: when resuming, saved server reference is not
dropped.
This prevents the server from being freed when we no longer use it.
Moreover, as the saved server might now be deleted
(SRV_F_DELETED flag set), the current deleted server may still be dumped
in the stats and while this is not a bug, this could be misleading for
the user.
Let's add a px_st variable to detect if the stats_dump_proxy_to_buffer()
is being resumed at the STAT_PX_ST_SV stage: perform some housekeeping
to skip deleted servers and properly drop the reference on the saved
server.
This commit depends on:
- "MINOR: server: add SRV_F_DELETED flag"
- "BUG/MINOR: server/del: fix legacy srv->next pointer consistency"
This should be backported up to 2.6
We recently discovered a bug which affects dynamic server deletion:
When a server is deleted, it is removed from the "visible" server list.
But as we've seen in previous commit
("MINOR: server: add SRV_F_DELETED flag"), it can still be accessed by
someone who keeps a reference on it (waiting for the final srv_drop()).
Throughout this transient state, server ptr is still valid (may be
dereferenced) and the flag SRV_F_DELETED is set.
However, as the server is not part of server list anymore, we have
an issue: srv->next pointer won't be updated anymore as the only place
where we perform such update is in cli_parse_delete_server() by
iterating over the "visible" server list.
Because of this, we cannot guarantee that a server with the
SRV_F_DELETED flag has a valid 'next' ptr: 'next' could be pointing
to a fully removed (already freed) server.
This problem can be easily demonstrated with server dumping in
the stats:
server list dumping is performed in stats_dump_proxy_to_buffer()
The function can be interrupted and resumed later by design.
ie: output buffer is full: partial dump and finish the dump after
the flush
This is implemented by calling srv_take() on the server being dumped,
and only releasing it when we're done with it using srv_drop().
(drop can be delayed after function resume if buffer is full)
While the function design seems OK, it works with the assumption that
srv->next will still be valid after the function resumes, which is
not true. (especially if multiple servers are being removed in between
the 2 dumping attempts)
In practice, this did not cause any crash yet (at least this was not
reported so far), because server dumping is so fast that it is very
unlikely that multiple server deletions make their way between 2
dumping attempts in most setups. But still, this is a problem that we
need to address because some upcoming work might depend on this
assumption as well and for the moment it is not safe at all.
========================================================================
Here is a quick reproducer:
With this patch, we're creating a large deletion window of 3s as soon
as we reach a server named "t2" while iterating over the list.
This will give us plenty of time to perform multiple deletions before
the function is resumed.
| diff --git a/src/stats.c b/src/stats.c
| index 84a4f9b6e..15e49b4cd 100644
| --- a/src/stats.c
| +++ b/src/stats.c
| @@ -3189,11 +3189,24 @@ int stats_dump_proxy_to_buffer(struct stconn *sc, struct htx *htx,
| * Temporarily increment its refcount to prevent its
| * anticipated cleaning. Call free_server to release it.
| */
| + struct server *orig = ctx->obj2;
| for (; ctx->obj2 != NULL;
| ctx->obj2 = srv_drop(sv)) {
|
| sv = ctx->obj2;
| + printf("sv = %s\n", sv->id);
| srv_take(sv);
| + if (!strcmp("t2", sv->id) && orig == px->srv) {
| + printf("deletion window: 3s\n");
| + thread_idle_now();
| + thread_harmless_now();
| + sleep(3);
| + thread_harmless_end();
| +
| + thread_idle_end();
| +
| + goto full; /* simulate full buffer */
| + }
|
| if (htx) {
| if (htx_almost_full(htx))
| @@ -4353,6 +4366,7 @@ static void http_stats_io_handler(struct appctx *appctx)
| struct channel *res = sc_ic(sc);
| struct htx *req_htx, *res_htx;
|
| + printf("http dump\n");
| /* only proxy stats are available via http */
| ctx->domain = STATS_DOMAIN_PROXY;
|
Ok, we're ready, now we start haproxy with the following conf:
global
stats socket /tmp/ha.sock mode 660 level admin expose-fd listeners thread 1-1
nbthread 2
frontend stats
mode http
bind *:8081 thread 2-2
stats enable
stats uri /
backend farm
server t1 127.0.0.1:1899 disabled
server t2 127.0.0.1:18999 disabled
server t3 127.0.0.1:18998 disabled
server t4 127.0.0.1:18997 disabled
And finally, we execute the following script:
curl localhost:8081/stats&
sleep .2
echo "del server farm/t2" | nc -U /tmp/ha.sock
echo "del server farm/t3" | nc -U /tmp/ha.sock
This should be enough to reveal the issue, I easily manage to
consistently crash haproxy with the following reproducer:
http dump
sv = t1
http dump
sv = t1
sv = t2
deletion window = 3s
[NOTICE] (2940566) : Server deleted.
[NOTICE] (2940566) : Server deleted.
http dump
sv = t2
sv = �����U
[1] 2940566 segmentation fault (core dumped) ./haproxy -f ttt.conf
========================================================================
To fix this, we add prev_deleted mt_list in server struct.
For a given "visible" server, this list will contain the pending
"deleted" servers references that point to it using their 'next' ptr.
This way, whenever this "visible" server is going to be deleted via
cli_parse_delete_server() it will check for servers in its
'prev_deleted' list and update their 'next' pointer so that they no
longer point to it, and then it will push them in its
'next->prev_deleted' list to transfer the update responsibility to the
next 'visible' server (if next != NULL).
Then, following the same logic, the server about to be removed in
cli_parse_delete_server() will push itself as well into its
'next->prev_deleted' list (if next != NULL) so that it may still use its
'next' ptr for the time it is in transient removal state.
In srv_drop(), right before the server is finally freed, we make sure
to remove it from the 'next->prev_deleted' list so that 'next' won't
try to perform the pointers update for this server anymore.
This has to be done atomically to prevent 'next' srv from accessing a
purged server.
As a result:
for a valid server, either deleted or not, 'next' ptr will always
point to a non deleted (ie: visible) server.
With the proposed fix, and several removal combinations (including
unordered cli_parse_delete_server() and srv_drop() calls), I cannot
reproduce the crash anymore.
Example tricky removal sequence that is now properly handled:
sv list: t1,t2,t3,t4,t5,t6
ops:
take(t2)
del(t4)
del(t3)
del(t5)
drop(t3)
drop(t4)
drop(t5)
drop(t2)
Set the SRV_F_DELETED flag when server is removed from the cli.
When removing a server from the cli (in cli_parse_delete_server()),
we update the "visible" server list so that the removed server is no
longer part of the list.
However, despite the server being removed from "visible" server list,
one could still access the server data from a valid ptr (ie: srv_take())
Deleted flag helps detecting when a server is in transient removal
state: that is, removed from the list, thus not visible but not yet
purged from memory.
SE_FL_EOS flag must never be set on the SE descriptor without SE_FL_EOI or
SE_FL_ERROR. When a mux or an applet report an end of stream, it must be
able to state if it is the end of input too or if it is an error.
Because all this part was recently refactored, especially the applet part,
it is a bit sensitive. Thus a BUG_ON_HOT() is used and not a BUG_ON().
The purpose of this patch is only a one-to-one replacement, as far as
possible.
CF_SHUTR(_NOW) and CF_SHUTW(_NOW) flags are now carried by the
stream-connecter. CF_ prefix is replaced by SC_FL_ one. Of course, it is not
so simple because at many places, we were testing if a channel was shut for
reads and writes in same time. To do the same, shut for reads must be tested
on one side on the SC and shut for writes on the other side on the opposite
SC. A special care was taken with process_stream(). flags of SCs must be
saved to be able to detect changes, just like for the channels.
Just like for other applets, we now use the SE descriptor instead of the
channel to report error and end-of-stream. Here, the applet is a bit
refactored to handle SE descriptor EOS, EOI and ERROR flags
The state of the opposite SC is already tested to wait the connection is
established before sending messages. So, there is no reason to test it again
before looping on the ring buffer.
Just like for other applets, we now use the SE descriptor instead of the
channel to report error and end-of-stream. We must just be sure to consume
request data when we are waiting the applet to be released.
Just like for other applets, we now use the SE descriptor instead of the
channel to report error and end-of-stream.
Here, the refactoring only reports errors by setting SE_FL_ERROR flag.
There are 3 kinds of applet in lua: The co-sockets, the TCP services and the
HTTP services. The three are refactored to use the SE descriptor instead of
the channel to report error and end-of-stream.
Just like for other applets, we now use the SE descriptor instead of the
channel to report error and end-of-stream. We must just be sure to consume
request data when we are waiting the applet to be released.
This patch is bit different than others because messages handling is
dispatched in several functions. But idea if the same.
It is now the dns turn to be refactored to use the SE descriptor instead of
the channel to report error and end-of-stream. We must just be sure to
consume request data when we are waiting the applet to be released.
The state of the opposite SC is already tested to wait the connection is
established before sending requests. So, there is no reason to test it again
before looping on the ring buffer.
It is the same kind of change than for the cache applet. Idea is to use the SE
desc instead of the channel or the SC to report end-of-input, end-of-stream and
errors.
Truncated commands are now reported on error. Other changes are the same than
for the cache applet. We now set SE_FL_EOS flag instead of calling cf_shutr()
and calls to cf_shutw are removed.
We now try, as far as possible, to rely on the SE descriptor to detect end
of processing. Idea is to no longer rely on the channel or the SC to do so.
First, we now set SE_FL_EOS instead of calling and cf_shutr() to report the
end of the stream. It happens when the response is fully sent (SE_FL_EOI is
already set in this case) or when an error is reported. In this last case,
SE_FL_ERROR is also set.
Thanks to this change, it is now possible to detect the applet must only
consume the request waiting for the upper layer releases it. So, if
SE_FL_EOS or SE_FL_ERROR are set, it means the reponse was fully
handled. And if SE_FL_SHR or SE_FL_SHW are set, it means the applet was
released by upper layer and is waiting to be freed.
Just like for end of input, the end of stream reported by the endpoint
(SE_FL_EOS flag) is now handled in sc_applet_process(). The idea is to have
applets acting as muxes by reporting events through the SE descriptor, as
far as possible.
Thanks to the previous patch, it is now possible for applets to not set the
CF_EOI flag on the channels. On this point, the applets get closer to the
muxes.
The end of input reported by the endpoint (SE_FL_EOI flag), is now handled
in sc_applet_process(). This function is always called after an applet was
called. So, the applets can now only report EOI on the SE descriptor and
have no reason to update the channel too.
EOS is now acknowledge at the end of sc_conn_recv(), even if an error was
encountered. There is no reason to not do so, especially because, if it not
performed here, it will be ack in sc_conn_process().
Note, it is still performed in sc_conn_process() because this function is
also the .wake callback function and can be directly called from the lower
layer.
On truncated message, a parsing error is still reported. But an error on the
SE descriptor is also reported. This will avoid any bugs in future. We are
know sure the SC is able to detect the error, independently on the HTTP
analyzers.
In h1_rcv_pipe(), only the end of stream was reported when a read0 was
detected. However, it is also important to report the end of input or an
error, depending on the message state. This patch does not fix any real
issue for now, but some others, specific to the 2.8, rely on it.
No backport needed.
In the PT multiplexer, the end of stream is also the end of input. Thus
we must report EOI to the stream-endpoint descriptor when the EOS is
reported. For now, it is a bit useless but it will be important to
disginguish an shutdown to an error to an abort.
To be sure to not report an EOI on an error, the errors are now handled
first.
In sc_conn_recv(), if the EOS is reported by the endpoint, it will always be
acknowledged by the SC and a read0 will be performed on the input
channel. Thus there is no reason to still test at the begining of the
function because there is already a test on CF_SHUTR.
When a response is consumed, result for co_getblk() is never checked. It
seems ok because amount of output data is always checked first. But There is
an issue when we try to get the first 2 bytes to read the message length. If
there is only one byte followed by a shutdown, the applet ignore the
shutdown and loop till the timeout to get more data.
So to avoid any issue and improve shutdown detection, the co_getblk() return
value is always tested. In addition, if there is not enough data, the applet
explicitly ask for more data by calling applet_need_more_data().
This patch relies on the previous one:
* BUG/MEDIUM: channel: Improve reports for shut in co_getblk()
Both should be backported as far as 2.4. On 2.5 and 2.4,
applet_need_more_data() must be replaced by si_rx_endp_more().
When co_getblk() is called with a length and an offset to 0, shutdown is
never reported. It may be an issue when the function is called to retrieve
all available output data, while there is no output data at all. And it
seems pretty annoying to handle this case in the caller.
Thus, now, in co_getblk(), -1 is returned when the channel is empty and a
shutdown was received.
There is no real reason to backport this patch alone. However, another fix
will rely on it.
There is a bug in a way the channels flags are checked to set clientfin or
serverfin timeout. Indeed, to set the clientfin timeout, the request channel
must be shut for reads (CF_SHUTR) or the response channel must be shut for
writes (CF_SHUTW). As the opposite, the serverfin timeout must be set when
the request channel is shut for writes (CF_SHUTW) or the response channel is
shut for reads (CF_SHUTR).
It is a 2.8-dev specific issue. No backport needed.
In the commut b08c5259e ("MINOR: stconn: Always report READ/WRITE event on
shutr/shutw"), a return statement was erroneously removed from
sc_app_shutr(). As a consequence, CF_SHUTR flags was never set. Fortunately,
it is the default .shutr callback function. Thus when a connection or an
applet is attached to the SC, another callback is used to performe a
shutdown for reads.
It is a 28-dev specific issue. No backport needed.
It is not possible to successfully match an empty response. However using
regex, it should be possible to reject response with any content. For
instance:
tcp-check expect !rstring ".+"
It may seem a be strange to do that, but it is possible, it is a valid
config. So it must work. Thanks to this patch, it is now really supported.
This patch may be backported as far as 2.2. But only if someone ask for it.
As timestamps based on now_ms values are used to compute the probing timeout,
they may wrap. So, use ticks API to compared them.
Must be backported to 2.7 and 2.6.
This this commit, this is ->idle_expire of quic_conn struct which must
be taken into an account to display the idel timer task expiration value:
MEDIUM: quic: Ack delay implementation
Furthermore, this value was always zero until now_ms has wrapped (20 s after
the start time) due to this commit:
MEDIUM: clock: force internal time to wrap early after boot
Do not rely on the value of now_ms compared to ->idle_expire to display the
difference but use ticks_remain() to compute it.
Must be backported to 2.7 where "show quic" has already been backported.
This bug arrived with this commit:
MEDIUM: quic: Ack delay implementation
It is possible that the idle timer task was already in the run queue when its
->expire field was updated calling qc_idle_timer_do_rearm(). To prevent this
task from running in this condition, one must check its ->expire field value
with this condition to run the task if its timer has really expired:
!tick_is_expired(t->expire, now_ms)
Furthermore, as this task may be directly woken up with a call to task_wakeup()
all, for instance by qc_kill_conn() to kill the connection, one must check this
task has really been woken up when it was in the wait queue and not by a direct
call to task_wakeup() thanks to this test:
(state & TASK_WOKEN_ANY) == TASK_WOKEN_TIMER
Again, when this condition is not fulfilled, the task must be run.
Must be backported where the commit mentionned above was backported.
As found in issue #2089, it's easy to mistakenly paste a colon in a
header name, or other chars (e.g. spaces) when quotes are in use, and
this causes all sort of trouble in field because such chars are rejected
by the peer.
Better try to detect these upfront. That's what we're doing here during
the parsing of the add-header/set-header/early-hint actions, where a
warning is emitted if a non-token character is found in a header name.
A special case is made for the colon at the beginning so that it remains
possible to place any future pseudo-headers that may appear. E.g:
[WARNING] (14388) : config : parsing [badchar.cfg:23] : header name 'X-Content-Type-Options:' contains forbidden character ':'.
This should be backported to 2.7, and ideally it should be turned to an
error in future versions.
As now_ms may be zero, these BUG_ON() could be triggered when its value has wrapped.
These call to BUG_ON() may be removed because the values they was supposed to
check are safely used by the ticks API.
Must be backported to 2.6 and 2.7.
This very old bug is there since the first implementation of newreno congestion
algorithm implementation. This was a very bad idea to put a state variable
into quic_cc_algo struct which only defines the congestion control algorithm used
by a QUIC listener, typically its type and its callbacks.
This bug could lead to crashes since BUG_ON() calls have been added to each algorithm
implementation. This was revealed by interop test, but not very often as there was
not very often several connections run at the time during these tests.
Hopefully this was also reported by Tristan in GH #2095.
Move the congestion algorithm state to the correct structures which are private
to a connection (see cubic and nr structs).
Must be backported to 2.7 and 2.6.
When entering a recovery period, the algo state is set by quic_enter_recovery().
And that's it!. These two lines should have been removed with this commit:
BUG/MINOR: quic: Wrong use of now_ms timestamps (cubic algo)
Take the opportunity of this patch to add a missing TRACE_LEAVE() call in
quic_cc_cubic_ca_cb().
Must be backported to 2.7 and 2.6.
This algorithm does nothing except initializing the congestion control window
to a fixed value. Very smart!
Modify the QUIC congestion control configuration parser to support this new
algorithm. The congestion control algorithm must be set as follows:
quic-cc-algo nocc-<cc window size(KB))
For instance if "nocc-15" is provided as quic-cc-algo keyword value, this
will set a fixed window of 15KB.
Depending on what we're debugging, some FDs can represent pollution in
the "show fd" output. Here we add a set of filters allowing to pick (or
exclude) any combination of listener, frontend conn, backend conn, pipes,
etc. "show fd l" will only list listening connections for example.
In ->srtt quic_loss struct this is 8*srtt which is stored so that not to have to multiply/devide
it to compute the RTT variance (at least). This is where there was a bug in quic_loss_srtt_update():
each time ->srtt must be used, it must be devided by 8 or right shifted by 3.
This bug had a very bad impact for network with non negligeable packet loss.
Must be backported to 2.6 and 2.7.
Reuse the idle timeout task to delay the acknowledgments. The time of the
idle timer expiration is for now on stored in ->idle_expire. The one to
trigger the acknowledgements is stored in ->ack_expire.
Add QUIC_FL_CONN_ACK_TIMER_FIRED new connection flag to mark a connection
as having its acknowledgement timer been triggered.
Modify qc_may_build_pkt() to prevent the sending of "ack only" packets and
allows the connection to send packet when the ack timer has fired.
It is possible that acks are sent before the ack timer has triggered. In
this case it is cancelled only if ACK frames are really sent.
The idle timer expiration must be set again when the ack timer has been
triggered or when it is cancelled.
Must be backported to 2.7.
Dump variables displayed by TRACE_ENTER() or TRACE_LEAVE() by calls to TRACE_PROTO().
No more variables are displayed by the two former macros. For now on, these information
are accessible from proto level.
Add new calls to TRACE_PROTO() at important locations in relation whith QUIC transport
protocol.
When relevant, try to prefix such traces with TX or RX keyword to identify the
concerned subpart (transmission or reception) of the protocol.
Must be backported to 2.7.
This callback was left as not implemented. It should at least display
the algorithm state, the control congestion window the slow start threshold
and the time of the current recovery period. Should be helpful to debug.
Must be backported to 2.7.
This bug was revealed by handshakeloss interop tests (often with quiceh) where one
could see haproxy an Initial packet without TLS ClientHello message (only a padded
PING frame). In this case, as the ->max_idle_timeout was not initialized, the
connection was closed about three seconds later, and haproxy opened a new one with
a new source connection ID upon receipt of the next client Initial packet. As the
interop runner count the number of source connection ID used by the server to check
there were exactly 50 such IDs used by the server, it considered the test as failed.
So, the ->max_idle_timeout of the connection must be at least initialized
to the local "max_idle_timeout" transport parameter value to avoid such
a situation (closing connections too soon) until it is "negotiated" with the
client when receiving its TLS ClientHello message.
Must be backported to 2.7 and 2.6.
This patch is similar to the one for cubic algorithm:
"BUG/MINOR: quic: Wrong use of timestamps with now_ms variable (cubic algo)"
As now_ms may wrap, one must use the ticks API to protect the cubic congestion
control algorithm implementation from side effects due to this.
Furthermore, to make the newreno congestion control algorithm more readable and easy
to maintain, add quic_cc_cubic_rp_cb() new callback for the "in recovery period"
state (QUIC_CC_ST_RP).
Must be backported to 2.7 and 2.6.
Add ->srtt, ->rtt_var, ->rtt_min and ->pto_count values from ->path->loss
struct to "show quic". Same thing for ->cwnd from ->path struct.
Also take the opportunity of this patch to dump the packet number
space information directly from ->pktns[] array in place of ->els[]
array. Indeed, ->els[QUIC_TLS_ENC_LEVEL_EARLY_DATA] and ->els[QUIC_TLS_ENC_LEVEL_APP]
have the same packet number space.
Must be backported to 2.7 where "show quic" implementation has alredy been
backported.
As now_ms may wrap, one must use the ticks API to protect the cubic congestion
control algorithm implementation from side effects due to this.
Furthermore to make the cubic congestion control algorithm more readable and easy
to maintain, adding a new state ("in recovery period" QUIC_CC_ST_RP new enum) helps
in reaching this goal. Implement quic_cc_cubic_rp_cb() which is the callback for
this new state.
Must be backported to 2.7 and 2.6.
If a bundle is used in a crt-list, the ssl-min-ver and ssl-max-ver
options were not taken into account in entries other than the first one
because the corresponding fields in the ssl_bind_conf structure were not
copied in crtlist_dup_ssl_conf.
This should fix GitHub issue #2069.
This patch should be backported up to 2.4.
In some extremely unlikely case (or even impossible for now), we might
exit cli_parse_update_ocsp_response without raising an error but with a
filled 'err' buffer. It was not properly free'd.
It does not need to be backported.
This patch removes dead code from the cli_parse_update_ocsp_response
function. The 'end' label in only used in case of error so the check of
the 'errcode' variable and the errcode variable itself become useless.
This patch does not need to be backported.
It fixes GitHub issue #2077.
During soft-stop, manage_proxy() (p->task) will try to purge
trashable (expired and not referenced) sticktable entries,
effectively releasing the process memory to leave some space
for new processes.
This is done by calling stktable_trash_oldest(), immediately
followed by a pool_gc() to give the memory back to the OS.
As already mentioned in dfe7925 ("BUG/MEDIUM: stick-table:
limit the time spent purging old entries"), calling
stktable_trash_oldest() with a huge batch can result in the function
spending too much time searching and purging entries, and ultimately
triggering the watchdog.
Lately, an internal issue was reported in which we could see
that the watchdog is being triggered in stktable_trash_oldest()
on soft-stop (thus initiated by manage_proxy())
According to the report, the crash seems to only occur since 5938021
("BUG/MEDIUM: stick-table: do not leave entries in end of window during purge")
This could be the result of stktable_trash_oldest() now working
as expected, and thus spending a large amount of time purging
entries when called with a large enough <to_batch>.
Instead of adding new checks in stktable_trash_oldest(), here we
chose to address the issue directly in manage_proxy().
Since the stktable_trash_oldest() function is called with
<to_batch> == <p->table->current>, it's pretty obvious that it could
cause some issues during soft-stop if a large table, assuming it is
full prior to the soft-stop, suddenly sees most of its entries
becoming trashable because of the soft-stop.
Moreover, we should note that the call to stktable_trash_oldest() is
immediately followed by a call to pool_gc():
We know for sure that pool_gc(), as it involves malloc_trim() on
glibc, is rather expensive, and the more memory to reclaim,
the longer the call.
We need to ensure that both stktable_trash_oldest() + consequent
pool_gc() call both theoretically fit in a single task execution window
to avoid contention, and thus prevent the watchdog from being triggered.
To do this, we now allocate a "budget" for each purging attempt.
budget is maxed out to 32K, it means that each sticktable cleanup
attempt will trash at most 32K entries.
32K value is quite arbitrary here, and might need to be adjusted or
even deducted from other parameters if this fails to properly address
the issue without introducing new side-effects.
The goal is to find a good balance between the max duration of each
cleanup batch and the frequency of (expensive) pool_gc() calls.
If most of the budget is actually spent trashing entries, then the task
will immediately be rescheduled to continue the purge.
This way, the purge is effectively batched over multiple task runs.
This may be slowly backported to all stable versions.
[Please note that this commit depends on 6e1fe25 ("MINOR: proxy/pool:
prevent unnecessary calls to pool_gc()")]
This commit adds a new optional argument to smp_fetch_url_param
and smp_fetch_url_param_val that makes the parameter key comparison
case-insensitive.
Now users can retrieve URL parameters regardless of their case,
allowing to match parameters in case insensitive application.
Doc was updated.
This commit adds a new argument to smp_fetch_url_param
that makes the parameter key comparison case-insensitive.
Several levels of callers were modified to pass this info.
In prevision of adding a third parameter to the url_param
sample-fetch function we need to make the second parameter optional.
User can now pass a empty 2nd argument to keep the default delimiter.
Since eb77824 ("MEDIUM: proxy: remove the deprecated "grace" keyword"),
stop_time is never set, so the related code in manage_proxy() is not
relevant anymore.
Removing code that refers to p->stop_time, since it was probably
overlooked.
Under certain soft-stopping conditions (ie: sticktable attached to proxy
and in-progress connections to the proxy that prevent haproxy from
exiting), manage_proxy() (p->task) will wake up every second to perform
a cleanup attempt on the proxy sticktable (to purge unused entries).
However, as reported by TimWolla in GH #2091, it was found that a
systematic call to pool_gc() could cause some CPU waste, mainly
because malloc_trim() (which is rather expensive) is being called
for each pool_gc() invocation.
As a result, such soft-stopping process could be spending a significant
amount of time in the malloc_trim->madvise() syscall for nothing.
Example "strace -c -f -p `pidof haproxy`" output (taken from
Tim's report):
% time seconds usecs/call calls errors syscall
------ ----------- ----------- --------- --------- ----------------
46.77 1.840549 3941 467 1 epoll_wait
43.82 1.724708 13 128509 sched_yield
8.82 0.346968 11 29696 madvise
0.58 0.023011 24 951 clock_gettime
0.01 0.000257 10 25 7 recvfrom
0.00 0.000033 11 3 sendto
0.00 0.000021 21 1 rt_sigreturn
0.00 0.000021 21 1 timer_settime
------ ----------- ----------- --------- --------- ----------------
100.00 3.935568 24 159653 8 total
To prevent this, we now only call pool_gc() when some memory is
really expected to be reclaimed as a direct result of the previous
stick table cleanup.
This is pretty straightforward since stktable_trash_oldest() returns
the number of trashed sticky sessions.
This may be backported to every stable versions.
This bug arrived with this commit:
MINOR: quic: Send PING frames when probing Initial packet number space
This may happen when haproxy needs to probe the peer with very short packets
(only one PING frame). In this case, the packet must be padded. There was clearly
a case which was removed by the mentionned commit above. That said, there was
an extra byte which was added to the PADDING frame before the mentionned commit
above. This is no more the case with this patch.
Thank you to @tatsuhiro-t (ngtcp2 manager) for having reported this issue which
was revealed by the keyupdate test (on client side).
Must be backported to 2.7 and 2.6.
When a backend H2 connection is waiting the connection is fully established,
nothing is sent. However, it remains useful to detect connection error at
this stage. It is especially important to release H2 connection on connect
error. Be able to set H2_CF_ERR_PENDiNG or H2_CF_ERROR flags when the
underlying connection is not fully established will exclude the H2C to be
inserted in a idle list in h2_detach().
Without this fix, an H2C in PREFACE state and relying on a connection in
error can be inserted in the safe list. Of course, it will be purged if not
reused. But in the mean time, it can be reused. When this happens, the
connection remains in error and nothing happens. At the end a connection
error is returned to the client. On low traffic, we can imagine a scenario
where this dead connection is the only idle connection. If it is always
reused before being purged, no connection to the server is possible.
In addition, h2c_is_dead() is updated to declare as dead any H2 connection
with a pending error if its state is PREFACE or SETTINGS1 (thus if no
SETTINGS frame was received yet).
This patch should fix the issue #2092. It must be backported as far as 2.6.
In commit c2c043ed4 ("BUG/MEDIUM: stats: Consume the request except when
parsing the POST payload"), a change about applet was pushed too early. The
applet must still call cf_shutr() when the response is fully sent. It is
planned to rely on SE_FL_EOS flag, just like connections. But it is not
possible for now.
However, at first glance, this bug has no visible effect.
It is 2.8-specific. No backport needed.
Previously performing a config check of `.github/h2spec.config` would report a
20 byte leak as reported in GitHub Issue #2082.
The leak was introduced in a6c0a59e9a, which is
dev only. No backport needed.
This patch follows this commit which was not sufficient:
BUG/MINOR: quic: Missing STREAM frame data pointer updates
Indeed, after updating the ->offset field, the bit which informs the
frame builder of its presence must be systematically set.
This bug was revealed by the following BUG_ON() from
quic_build_stream_frame() :
bug condition "!!(frm->type & 0x04) != !!stream->offset.key" matched at src/quic_frame.c:515
This should fix the last crash occured on github issue #2074.
Must be backported to 2.6 and 2.7.
Since 465a6c8 ("BUG/MEDIUM: applet: only set appctx->sedesc on
successful allocation"), sedesc is attached to the appctx after the
task is successfully allocated.
If the task fails to allocate: current sedesc cleanup is performed
on appctx->sedesc which still points to NULL so sedesc won't be
freed.
This is fine when sedesc is provided as argument (!=NULL), but leads
to memory leaks if sedesc is allocated locally.
It was shown in GH #2086 that if sedesc != NULL when passed as
argument, it shouldn't be freed on error paths. This is what 465a6c8
was trying to address.
In an attempt to fix both issues at once, we do as Christopher
suggested: that is moving sedesc allocation attempt at the
end of the function, so that we don't have to free it in case
of error, thus removing the ambiguity.
(We won't risk freeing a sedesc that does not belong to us)
If we fail to allocate sedesc, then the task that was previously
created locally is simply destroyed.
This needs to be backported to 2.6 with 465a6c8 ("BUG/MEDIUM: applet:
only set appctx->sedesc on successful allocation")
[Copy pasting the original backport note from Willy:
In 2.6 the function is slightly
different and called appctx_new(), though the issue is exactly the
same.]
This old bug was revealed because of the commit 407210a34 ("BUG/MEDIUM:
stconn: Don't rearm the read expiration date if EOI was reached"). But it is
still possible to hit it if there is no server timeout. At first glance, the
2.8 is not affected. But the fix remains valid.
When a shutdown for writes if performed the H1 connection must be notified
to be released. If it is subscribed for any I/O events, it is not an
issue. But, if there is no subscription, no I/O event is reported to the H1
connection and it remains alive. If the message was already fully received,
nothing more happens.
On my side, I was able to trigger the bug by freezing the session. Some
users reported a spinning loop on process_stream(). Not sure how to trigger
the loop. To freeze the session, the client timeout must be reached while
the server response was already fully received. On old version (< 2.6), it
only happens if there is no server timeout.
To fix the issue, we must wake up the H1 connection on shutdown for writes
if there is no I/O subscription.
This patch must be backported as far as 2.0. It should fix the issue #2090
and #2088.
The stats applet is designed to consume the request at the end, when it
finishes to send the response. And during the response forwarding, because
the request is not consumed, the applet states it will not consume
data. This avoid to wake the applet up in loop. When it finishes to send the
response, the request is consumed.
For POST requests, there is no issue because the response is small
enough. It is sent in one time and must be processed by HTTP analyzers. Thus
the forwarding is not performed by the applet itself. The applet is always
able to consume the request, regardless the payload length.
But for other requests, it may be an issue. If the response is too big to be
sent in one time and if the requests is not fully received when the response
headers are sent, the applet may be blocked infinitely, not consuming the
request. Indeed, in the case the applet will be switched in infinite forward
mode, the request will not be consumed immediately. At the end, the request
buffer is flushed. But if some data must still be received, the applet is not
woken up because it is still in a "not-consuming" mode.
So, to fix the issue, we must take care to re-enable data consuming when the
end of the response is reached.
This patch must be backported as far as 2.6.
In the syslog applet, when a message was not fully received, we must request
for more data by calling appctx_need_more_data() and not by setting
CF_READ_DONTWAIT flag on the request channel. Indeed, this flag is only used
to only try a read at once.
This patch could be backported as far as 2.4. On 2.5 and 2.4,
applet_need_more_data() must be replaced by si_cant_get().
Replace all BUG_ON() on frame allocation failure by a CONNECTION_CLOSE
sending with INTERNAL_ERROR code. This can happen for the following
cases :
* sending of MAX_STREAM_DATA
* sending of MAX_DATA
* sending of MAX_STREAMS_BIDI
In other cases (STREAM, STOP_SENDING, RESET_STREAM), an allocation
failure will only result in the current operation to be interrupted and
retried later. However, it may be desirable in the future to replace
this with a simpler CONNECTION_CLOSE emission to recover better under a
memory pressure issue.
This should be backported up to 2.7.
Emit a CONNECTION_CLOSE with INTERNAL_ERROR code each time qcs
allocation fails. This can happen in two cases :
* when creating a local stream through application layer
* when instantiating a remote stream through qcc_get_qcs()
In both cases, error paths are already in place to interrupt the current
operation and a CONNECTION_CLOSE will be emitted soon after.
This should be backported up to 2.7.
Add BUG_ON() statements to ensure qcc_emit_cc()/qcc_emit_cc_app() is not
called more than one time for each connection. This should improve code
resilience of MUX-QUIC and H3 and it will ensure that a scheduled
CONNECTION_CLOSE is not overwritten by another one with a different
error code.
This commit relies on the previous one to ensure all QUIC operations are
not conducted as soon as a CONNECTION_CLOSE has been prepared :
commit d7fbf458f8a4c5b09cbf0da0208fbad70caaca33
MINOR: mux-quic: interrupt most operations if CONNECTION_CLOSE scheduled
This should be backported up to 2.7.
Ensure that external MUX operations are interrupted if a
CONNECTION_CLOSE is scheduled. This was already the cases for some
functions. This is extended to the qcc_recv*() family for
MAX_STREAM_DATA, RESET_STREAM and STOP_SENDING.
Also, qcc_release_remote_stream() is skipped in qcs_destroy() if a
CONNECTION_CLOSE is already scheduled.
All of this will ensure we only proceed to minimal treatment as soon as
a CONNECTION_CLOSE is prepared. Indeed, all sending and receiving is
stopped as soon as a CONNECTION_CLOSE is emitted so only internal
cleanup code should be necessary at this stage.
This should prevent a registered CONNECTION_CLOSE error status to be
overwritten by an error in a follow-up treatment.
This should be backported up to 2.7.
HTTP/3 graceful shutdown operation is used to emit a GOAWAY followed by
a CONNECTION_CLOSE with H3_NO_ERROR status. It is used for every
connection on release which means that if a CONNECTION_CLOSE was already
registered for a previous error, its status code is overwritten.
To fix this, skip shutdown operation if a CONNECTION_CLOSE is already
registered at the MUX level. This ensures that the correct error status
is reported to the peer.
This should be backported up to 2.6. Note that qc_shutdown() does not
exists on 2.6 so modification will have to be made directly in
qc_release() as followed :
diff --git a/src/mux_quic.c b/src/mux_quic.c
index 49df0dc418..3463222956 100644
--- a/src/mux_quic.c
+++ b/src/mux_quic.c
@@ -1766,19 +1766,21 @@ static void qc_release(struct qcc *qcc)
TRACE_ENTER(QMUX_EV_QCC_END, conn);
- if (qcc->app_ops && qcc->app_ops->shutdown) {
- /* Application protocol with dedicated connection closing
- * procedure.
- */
- qcc->app_ops->shutdown(qcc->ctx);
+ if (!(qcc->flags & QC_CF_CC_EMIT)) {
+ if (qcc->app_ops && qcc->app_ops->shutdown) {
+ /* Application protocol with dedicated connection closing
+ * procedure.
+ */
+ qcc->app_ops->shutdown(qcc->ctx);
- /* useful if application protocol should emit some closing
- * frames. For example HTTP/3 GOAWAY frame.
- */
- qc_send(qcc);
- }
- else {
- qcc_emit_cc_app(qcc, QC_ERR_NO_ERROR, 0);
+ /* useful if application protocol should emit some closing
+ * frames. For example HTTP/3 GOAWAY frame.
+ */
+ qc_send(qcc);
+ }
+ else {
+ qcc_emit_cc_app(qcc, QC_ERR_NO_ERROR, 0);
+ }
}
if (qcc->task) {
A H3 unidirectional stream is always opened with its stream type first
encoded as a QUIC variable integer. If the STREAM frame contains some
data but not enough to decode this varint, haproxy would crash due to an
ABORT_NOW() statement.
To fix this, ensure we support an incomplete stream type. In this case,
h3_init_uni_stream() returns 0 and the buffer content is not cleared.
Stream decoding will resume when new data are received for this stream
which should be enough to decode the stream type varint.
This bug has never occured on production because standard H3 stream types
are small enough to be encoded on a single byte.
This should be backported up to 2.6.
Instead of reporting the inaccurate "malloc_trim() support" on -vv, let's
report the case where the memory allocator was actively replaced from the
one used at build time, as this is the corner case we want to be cautious
about. We also put a tainted bit when this happens so that it's possible
to detect it at run time (e.g. the user might have inherited it from an
environment variable during a reload operation).
The now unused is_trim_enabled() function was finally dropped.
The runtime detection of the default memory allocator was broken by
Commit d8a97d8f6 ("BUG/MINOR: illegal use of the malloc_trim() function
if jemalloc is used") due to a misunderstanding of its role. The purpose
is not to detect whether we're on non-jemalloc but whether or not the
allocator was changed from the one we booted with, in which case we must
be extra cautious and absolutely refrain from calling malloc_trim() and
its friends.
This was done only to drop the message saying that malloc_trim() is
supported, which will be totally removed in another commit, and could
possibly be removed even in older versions if this patch would get
backported since in the end it provides limited value.
There's a recurring issue regarding shared library loading from Lua. If
the imported library is linked with a different version of openssl but
doesn't use it, the check will trigger and emit a warning. In practise
it's not necessarily a problem as long as the API is the same, because
all symbols are replaced and the library will use the included ssl lib.
It's only a problem if the library comes with a different API because
the dynamic linker will only replace known symbols with ours, and not
all. Thus the loaded lib may call (via a static inline or a macro) a
few different symbols that will allocate or preinitialize structures,
and which will then pass them to the common symbols coming from a
different and incompatible lib, exactly what happens to users of Lua's
luaossl when building haproxy with quictls and without rebuilding
luaossl.
In order to better address this situation, we now define groups of
symbols that must always appear/disappear in a consistent way. It's OK
if they're all absent from either haproxy or the lib, it means that one
of them doesn't use them so there's no problem. But if any of them is
defined on any side, all of them must be in the exact same state on the
two sides. The symbols are represented using a bit in a mask, and the
mask of the group of other symbols they're related to. This allows to
check 64 symbols, this should be OK for a while. The first ones that
are tested for now are symbols whose combination differs between
openssl versions 1.0, 1.1, and 3.0 as well as quictls. Thus a difference
there will indicate upcoming trouble, but no error will mean that we're
running on a seemingly compatible API and that all symbols should be
replaced at once.
The same mechanism could possibly be used for pcre/pcre2, zlib and the
few other optional libs that may occasionally cause runtime issues when
used by dependencies, provided that discriminatory symbols are found to
distinguish them. But in practice such issues are pretty rare, mainly
because loading standard libs via Lua on a custom build of haproxy is
not pretty common.
In the event that further symbol compatibility issues would be reported
in the future, backporting this patch as well as the following series
might be an acceptable solution given that the scope of changes is very
narrow (the malloc stuff is needed so that the malloc/free tests can be
dropped):
BUG/MINOR: illegal use of the malloc_trim() function if jemalloc is used
MINOR: pools: make sure 'no-memory-trimming' is always used
MINOR: pools: intercept malloc_trim() instead of trying to plug holes
MEDIUM: pools: move the compat code from trim_all_pools() to malloc_trim()
MINOR: pools: export trim_all_pools()
MINOR: pattern: use trim_all_pools() instead of a conditional malloc_trim()
MINOR: tools: relax dlopen() on malloc/free checks
Now that we can provide a safe malloc_trim() we don't need to detect
anymore that some dependencies use a different set of malloc/free
functions than ours because they will use the same as those we're
seeing, and we control their use of malloc_trim(). The comment about
the incompatibility with DEBUG_MEM_STATS is not true anymore either
since the feature relies on macros so we're now OK.
This will stop catching libraries linked against glibc's allocator
when haproxy is natively built with jemalloc. This was especially
annoying since dlopen() on a lib depending on jemalloc() tends to
fail on TLS issues.
We already have some generic code in trim_all_pools() to implement the
equivalent of malloc_trim() on jemalloc and macos. Instead of keeping the
logic there, let's just move it to our own malloc_trim() implementation
so that we can unify the mechanism and the logic. Now any low-level code
calling malloc_trim() will either be disabled by haproxy's config if the
user decides to, or will be mapped to the equivalent mechanism if malloc()
was intercepted by a preloaded jemalloc.
Trim_all_pools() preserves the benefit of serializing threads (which we
must not impose to other libs which could come with their own threads).
It means that our own code should mostly use trim_all_pools() instead of
calling malloc_trim() directly.
As reported by Miroslav in commit d8a97d8f6 ("BUG/MINOR: illegal use of
the malloc_trim() function if jemalloc is used") there are still occasional
cases where it's discovered that malloc_trim() is being used without its
suitability being checked first. This is a problem when using another
incompatible allocator. But there's a class of use cases we'll never be
able to cover, it's dynamic libraries loaded from Lua. In order to address
this more reliably, we now define our own malloc_trim() that calls the
previous one after checking that the feature is supported and that the
allocator is the expected one. This way child libraries that would call
it will also be safe.
The function is intentionally left defined all the time so that it will
be possible to clean up some code that uses it by removing ifdefs.
The global option 'no-memory-trimming' was added in 2.6 with commit
c4e56dc58 ("MINOR: pools: add a new global option "no-memory-trimming"")
but there were some cases left where it was not considered. Let's make
is_trim_enabled() also consider it.
Complete traces with information from qcc and qcs instances about
flow-control level. This should help to debug further issue on sending.
This must be backported up to 2.7.
Change the trace from developer to data level whenever the flow control
limitation is updated following a MAX_DATA or MAX_STREAM_DATA reception.
This should be backported up to 2.7.
Add traces for _qc_send_qcs() function. Most notably, traces have been
added each time a qc_stream_desc buffer allocation fails and when stream
or connection flow-level is reached. This should improve debugging for
emission issues.
This must be backported up to 2.7.
Connection flow-control level calculation is a bit complicated. To
ensure it is never exceeded, each time a transfer occurs from a
qcs.tx.buf to its qc_stream_desc buffer it is accounted in
qcc.tx.offsets at the connection level. This value is not decremented
even if the corresponding STREAM frame is rejected by the quic-conn
layer as its emission will be retried later.
In normal cases this works as expected. However there is an issue if a
qcs instance is removed with prepared data left. In this case, its data
is still accounted in qcc.tx.offsets despite being removed which may
block other streams. This happens every time a qcs is reset with
remaining data which will be discarded in favor of a RESET_STREAM frame.
To fix this, if a stream has prepared data in qcc_reset_stream(), it is
decremented from qcc.tx.offsets. A BUG_ON() has been added to ensure
qcs_destroy() is never called for a stream with prepared data left.
This bug can cause two issues :
* transfer freeze as data unsent from closed streams still count on the
connection flow-control limit and will block other streams. Note that
this issue was not reproduced so it's unsure if this really happens
without the following issue first.
* a crash on a BUG_ON() statement in qc_send() loop over
qc_send_frames(). Streams may remained in the send list with nothing
to send due to connection flow-control limit. However, limit is never
reached through qcc_streams_sent_done() so QC_CF_BLK_MFCTL flag is not
set which will allow the loop to continue.
The last case was reproduced after several minutes of testing using the
following command :
$ ngtcp2-client --exit-on-all-streams-close -t 0.1 -r 0.1 \
--max-data=100K -n32 \
127.0.0.1 20443 "https://127.0.0.1:20443/?s=1g" 2>/dev/null
This should fix github issues #2049 and #2074.
In the event that HAProxy is linked with the jemalloc library, it is still
shown that malloc_trim() is enabled when executing "haproxy -vv":
..
Support for malloc_trim() is enabled.
..
It's not so much a problem as it is that malloc_trim() is called in the
pat_ref_purge_range() function without any checking.
This was solved by setting the using_default_allocator variable to the
correct value in the detect_allocator() function and before calling
malloc_trim() it is checked whether the function should be called.
Starting haproxy with -dL helps enumerate the list of libraries in use.
But sometimes in order to go further we'd like to see their address
ranges. This is already supported on the CLI's "show libs" but not on
the command line where it can sometimes help troubleshoot startup issues.
Let's dump them when in verbose mode. This way it doesn't change the
existing behavior for those trying to enumerate libs to produce an archive.
When threads are disabled, the compiler complains that we might be
accessing tg->abs[] out of bounds since the array is of size 1. It
cannot know that the condition to do this is never met, and given
that it's not in a fast path, we can make it more obvious.
qc_notify_send() is used to wake up the MUX layer for sending. This
function first ensures that all sending condition are met to avoid to
wake up the MUX for unnecessarily.
One of this condition is to check if there is room in the congestion
window. However, when probe packets must be sent due to a PTO
expiration, RFC 9002 explicitely mentions that the congestion window
must be ignored which was not the case prior to this patch.
This commit fixes this by first setting <pto_probe> of 01RTT packet
space before invoking qc_notify_send(). This ensures that congestion
window won't be checked anymore to wake up the MUX layer until probing
packets are sent.
This commit replaces the following one which was not sufficient :
commit e25fce03eb
BUG/MINOR: quic: Dysfunctional 01RTT packet number space probing
This should be backported up to 2.7.
On PTO probe timeout expiration, a probe packet must be emitted.
quic_pto_pktns() is used to determine for which packet space the timer
has expired. However, if MUX is already subscribed for sending, it is
woken up without checking first if this happened for the 01RTT packet
space.
It is unsure that this is really a bug as in most cases, MUX is
established only after Initial and Handshake packet spaces are removed.
However, the situation is not se clear when 0-RTT is used. For this
reason, adjust the code to explicitely check for the 01RTT packet space
before waking up the MUX layer.
This should be backported up to 2.6. Note that qc_notify_send() does not
exists in 2.6 so it should be replaced by the explicit block checking
(qc->subs && qc->subs->events & SUB_RETRY_SEND).
If appctx_new_on() fails to allocate a task, it will not remove the
freshly allocated sedesc from the appctx despite freeing it, causing
a UAF. Let's only assign appctx->sedesc upon success.
This needs to be backported to 2.6. In 2.6 the function is slightly
different and called appctx_new(), though the issue is exactly the
same.
In h1c_frt_stream_new() and h1c_bck_stream_new(), if we fail to completely
initialize the freshly allocated h1s, typically because sc_attach_mux()
fails, we must use h1s_destroy() to de-initialize it. Otherwise it stays
attached to the h1c when released, causing use-after-free upon the next
wakeup. This can be triggered upon memory shortage.
This needs to be backported to 2.6.
Using -dMfail alone does nothing unless tune.fail-alloc is set, which
renders it pretty useless as-is, and is not intuitive. Let's change
this so that the filure rate is preset to 1% when the option is set on
the command line. This allows to inject failures without having to edit
the configuration.
If we fail to allocate a new stream in sc_new_from_endp(), and the call
to sc_new() allocated the sedesc itself (which normally doesn't happen),
then it doesn't get released on the failure path. Let's explicitly
handle this case so that it's not overlooked and avoids some head
scratching sessions.
This may be backported to 2.6.
There's an occasional crash that can be triggered in sc_detach_endp()
when calling conn->mux->detach() upon memory allocation error. The
problem in fact comes from sc_attach_mux(), which doesn't reset the
sc type flags upon tasklet allocation failure, leading to an attempt
at detaching an incompletely initialized stconn. Let's just attach
the sc after the tasklet allocation succeeds, not before.
This must be backported to 2.6.
On the allocation error path in h2_init() we may check if
h2c->wait_event.tasklet needs to be released but it has not yet been
zeroed. Let's do this before jumping to the freeing location.
This needs to be backported to all maintained versions.
In h2s_close() we may dereference h2s->sd to get the sc, but this
function may be called on allocation error paths, so we must check
for this specific condition. Let's also update the comment to make
it explicitly permitted.
This needs to be backported to 2.6.
In stream_free() if we fail to allocate s->scb() we go to the path where
we try to free it, and it doesn't like being called with a null at all.
It's easily reproducible with -dMfail,no-cache and "tune.fail-alloc 10"
in the global section.
This must be backported to 2.6.
This bug arrived with this commit:
"MINOR: quic: implement qc_notify_send()".
The ->tx.pto_probe variable was no more set when qc_processt_timer() the timer
task for the connection responsible of detecting packet loss and probing upon
PTO expiration leading to interrupted stream transfers. This was revealed by
blackhole interop failed tests where one could see that qc_process_timer()
was wakeup without traces as follows in the log file:
"needs to probe 01RTT packet number space"
Must be backported to 2.7 and to 2.6 if the commit mentionned above
is backported to 2.6 in the meantime.
The ACK frame range of packets were handled from the largest to the smallest
packet number, leading to big number of ebtree insertions when the packet are
handled in the inverse way they are sent. This was detected a long time ago
but left in the code to stress our implementation. It is time to be more
efficient and process the packet so that to avoid useless ebtree insertions.
Modify qc_ackrng_pkts() responsible of handling the acknowledged packets from an
ACK frame range of acknowledged packets.
Must be backported to 2.7.
Despite having replaced the SSL BIOs to use our own raw_sock layer, we
still didn't exploit the CO_SFL_MSG_MORE flag which is pretty useful to
avoid sending incomplete packets. It's particularly important for SSL
since the extra overhead almost guarantees that each send() will be
followed by an incomplete (and often odd-sided) segment.
We already have an xprt_st set of flags to pass info to the various
layers, so let's just add a new one, SSL_SOCK_SEND_MORE, that is set
or cleared during ssl_sock_from_buf() to transfer the knowledge of
CO_SFL_MSG_MORE. This way we can recover this information and pass
it to raw_sock.
This alone is sufficient to increase by ~5-10% the H2 bandwidth over
SSL when multiple streams are used in parallel.
Traces show that sendto() rarely has MSG_MORE on H2 despite sending
multiple buffers. The reason is that the loop iterating over the buffer
ring doesn't have this info and doesn't pass it down.
But now we know how many buffers are left to be sent, so we know whether
or not the current buffer is the last one. As such we can set this flag
for all buffers but the last one.
Before muxes were used, we used to refrain from reading past the
buffer's reserve. But with muxes which have their own buffer, this
rule was a bit forgotten, resulting in an extraneous read to be
performed just because the rx buffer cannot be entirely transferred
to the stream layer:
sendto(12, "GET /?s=16k HTTP/1.1\r\nhost: 127."..., 84, MSG_DONTWAIT|MSG_NOSIGNAL, NULL, 0) = 84
recvfrom(12, "HTTP/1.1 200\r\nContent-length: 16"..., 16320, 0, NULL, NULL) = 16320
recvfrom(12, ".123456789.12345", 16, 0, NULL, NULL) = 16
recvfrom(12, "6789.123456789.12345678\n.1234567"..., 15244, 0, NULL, NULL) = 182
recvfrom(12, 0x1e5d5d6, 15062, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
Here the server sends 16kB of payload after a headers block, the mux reads
16320 into the ibuf, and the stream layer consumes 15360 from the first
h1_rcv_buf(), which leaves 960 into the buffer and releases a few indexes.
The buffer cannot be realigned due to these remaining data, and a subsequent
read is made on 16 bytes, then again on 182 bytes.
By avoiding to read too much on the first call, we can avoid needlessly
filling this buffer:
recvfrom(12, "HTTP/1.1 200\r\nContent-length: 16"..., 15360, 0, NULL, NULL) = 15360
recvfrom(12, "456789.123456789.123456789.12345"..., 16220, 0, NULL, NULL) = 1158
recvfrom(12, 0x1d52a3a, 15062, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
This is much more efficient and uses less RAM since the first buffer that
was emptied can now be released.
Note that a further improvement (tested) consists in reading even less
(typically 1kB) so that most of the data are transferred in zero-copy, and
are not read until process_stream() is scheduled. This patch doesn't do that
for now so that it can be backported without any obscure impact.
CertiK Skyfall Team reported that passing an index greater than
QPACK_SHT_SIZE in a qpack instruction referencing a literal field
name with name reference or and indexed field line will cause a
read out of bounds that may crash the process, and confirmed that
this fix addresses the issue.
This needs to be backported as far as 2.5.
sc-add-gpc() was implemented in 5a72d03 ("MINOR: stick-table:
implement the sc-add-gpc() action")
This new action was exposed everywhere sc-inc-gpc() is available,
except for http-after-response.
But there doesn't seem to be a technical constraint that prevents us from
exposing it in http-after-response.
It was probably overlooked, let's add it.
No backport needed, unless 5a72d03 ("MINOR: stick-table: implement the
sc-add-gpc() action") is being backported.
This patch follows this one which was not sufficient:
"BUG/MINOR: quic: Missing STREAM frame length updates"
Indeed, it is not sufficient to update the ->len and ->offset member
of a STREAM frame to move it forward. The data pointer must also be updated.
This is not done by the STREAM frame builder.
Must be backported to 2.6 and 2.7.
Emeric noticed that h2 bit-rate performance was always slightly lower
than h1 when the CPU is saturated. Strace showed that we were always
data in 2kB chunks, corresponding to the max_record size. What's
happening is that when this mechanism of dynamic record size was
introduced, the STREAMER flag at the stream level was relied upon.
Since all this was moved to the muxes, the flag has to be passed as
an argument to the snd_buf() function, but the mux h2 did not use it
despite a comment mentioning it, probably because before the multi-buf
it was not easy to figure the status of the buffer.
The solution here consists in checking if the mbuf is congested or not,
by checking if it has more than one buffer allocated. If so we set the
CO_SFL_STREAMER flag, otherwise we don't. This way moderate size
exchanges continue to be made over small chunks, but downloads will
be able to use the large ones.
While it could be backported to all supported versions, it would be
better to limit it to the last LTS, so let's do it for 2.7 and 2.6 only.
This patch requires previous commit "MINOR: buffer: add br_single() to
check if a buffer ring has more than one buf".
During performance tests, Emeric faced a case where the wakeups of
sc_conn_io_cb() caused by h2_resume_each_sending_h2s() was multiplied
by 5-50 and a lot of CPU was being spent doing this for apparently no
reason.
The culprit is h2_send() not behaving well with congested buffers and
small SSL records. What happens when the output is congested is that
all buffers are full, and data are emitted in 2kB chunks, which are
sufficient to wake all streams up again to ask them to send data again,
something that will obviously only work for one of them at best, and
waste a lot of CPU in wakeups and memcpy() due to the small buffers.
When this happens, the performance can be divided by 2-2.5 on large
objects.
Here the chosen solution against this is to keep in mind that as long
as there are still at least two buffers in the ring after calling
xprt->snd_buf(), it means that the output is congested and there's
no point trying again, because these data will just be placed into
such buffers and will wait there. Instead we only mark the buffer
decongested once we're back to a single allocated buffer in the ring.
By doing so we preserve the ability to deal with large concurrent
bursts while not causing a thundering herd by waking all streams for
almost nothing.
This needs to be backported to 2.7 and 2.6. Other versions could
benefit from it as well but it's not strictly necessary, and we can
reconsider this option if some excess calls to sc_conn_io_cb() are
faced.
Note that this fix depends on this recent commit:
MINOR: buffer: add br_single() to check if a buffer ring has more than one buf
When detaching a stream, if it's the last one and the mbuf is blocked,
we leave without freeing the stream yet. We also refresh the h2c task's
timeout, except that it's possible that there's no such task in case
there is no client timeout, causing a crash. The fix just consists in
doing this when the task exists.
This bug has always been there and is extremely hard to meet even
without a client timeout. This fix has to be backported to all
branches, but it's unlikely anyone has ever met it anyay.
The commit 5e1b0e7bf ("BUG/MEDIUM: connection: Clear flags when a conn is
removed from an idle list") introduced a regression. CO_FL_SAFE_LIST and
CO_FL_IDLE_LIST flags are used when the connection is released to properly
decrement used/idle connection counters. if a connection is idle, these
flags must be preserved till the connection is really released. It may be
removed from the list but not immediately released. If these flags are lost
when it is finally released, the current number of used connections is
erroneously decremented. If means this counter may become negative and the
counters tracking the number of idle connecitons is not decremented,
suggesting a leak.
So, the above commit is reverted and instead we improve a bit the way to
detect an idle connection. The function conn_get_idle_flag() must now be
used to know if a connection is in an idle list. It returns the connection
flag corresponding to the idle list if the connection is idle
(CO_FL_SAFE_LIST or CO_FL_IDLE_LIST) or 0 otherwise. But if the connection
is scheduled to be removed, 0 is also returned, regardless the connection
flags.
This new function is used when the connection is temporarily removed from
the list to be used, mainly in muxes.
This patch should fix#2078 and #2057. It must be backported as far as 2.2.
Some STREAM frame lengths were not updated before being duplicated, built
of requeued contrary to their ack offsets. This leads haproxy to crash when
receiving acknowledgements for such frames with this thread #1 backtrace:
Thread 1 (Thread 0x7211b6ffd640 (LWP 986141)):
#0 ha_crash_now () at include/haproxy/bug.h:52
No locals.
#1 b_del (b=<optimized out>, del=<optimized out>) at include/haproxy/buf.h:436
No locals.
#2 qc_stream_desc_ack (stream=stream@entry=0x7211b6fd9bc8, offset=offset@entry=53176, len=len@entry=1122) at src/quic_stream.c:111
Thank you to @Tristan971 for having provided such traces which reveal this issue:
[04|quic|5|c_conn.c:1865] qc_requeue_nacked_pkt_tx_frms(): entering : qc@0x72119c22cfe0
[04|quic|5|_frame.c:1179] qc_frm_unref(): entering : qc@0x72119c22cfe0
[04|quic|5|_frame.c:1186] qc_frm_unref(): remove frame reference : qc@0x72119c22cfe0 frm@0x72118863d260 STREAM_F uni=0 fin=1 id=460 off=52957 len=1122 3244
[04|quic|5|_frame.c:1194] qc_frm_unref(): leaving : qc@0x72119c22cfe0
[04|quic|5|c_conn.c:1902] qc_requeue_nacked_pkt_tx_frms(): updated partially acked frame : qc@0x72119c22cfe0 frm@0x72119c472290 STREAM_F uni=0 fin=1 id=460 off=53176 len=1122
Note that haproxy has much more chance to crash if this frame is the last one
(fin bit set). But another condition must be fullfilled to update the ack offset.
A previous STREAM frame from the same stream with the same offset but with less
data must be acknowledged by the peer. This is the condition to update the ack offset.
For others frames without fin bit in the same conditions, I guess the stream may be
truncated because too much data are removed from the stream when they are
acknowledged.
Must be backported to 2.6 and 2.7.
There is a bug in the smp_fetch_dport() function which affects the 'f' case,
also known as 'fc_dst_port' sample fetch.
conn_get_src() is used to retrieve the address prior to calling conn_dst().
But this is wrong: conn_get_dst() should be used instead.
Because of that, conn_dst() may return unexpected results since the dst
address is not guaranteed to be set depending on the conn state at the time
the sample fetch is used.
This was reported by Corin Langosch on the ML:
during his tests he noticed that using fc_dst_port in a log-format string
resulted in the correct value being printed in the logs but when he used it
in an ACL, the ACL did not evaluate properly.
This can be easily reproduced with the following test conf:
|frontend test-http
| bind 127.0.0.1:8080
| mode http
|
| acl test fc_dst_port eq 8080
| http-request return status 200 if test
| http-request return status 500 if !test
A request on 127.0.0.1:8080 should normally return 200 OK, but here it
will return a 500.
The same bug was also found in smp_fetch_dst_is_local() (fc_dst_is_local
sample fetch) by reading the code: the fix was applied twice.
This needs to be backported up to 2.5
[both sample fetches were introduced in 2.5 with 888cd70 ("MINOR:
tcp-sample: Add samples to get original info about client connection")]
When a connection error is encountered during a receive, the error is not
immediatly reported to the SE descriptor. We take care to process all
pending input data first. However, when the error is finally reported, a
fatal error is reported only if a read0 was also received. Otherwise, only a
pending error is reported.
With a raw socket, it is not an issue because shutdowns for read and write
are systematically reported too when a connection error is detected. So in
this case, the fatal error is always reported to the SE descriptor. But with
a SSL socket, in case of pure SSL error, only the connection error is
reported. This prevent the fatal error to go up. And because the connection
is in error, no more receive or send are preformed on the socket. The mux is
blocked till a timeout is triggered at the stream level, looping infinitly
to progress.
To fix the bug, during the demux stage, when there is no longer pending
data, the read error is reported to the SE descriptor, regardless the
shutdown for reads was received or not. No more data are expected, so it is
safe.
This patch should fix the issue #2046. It must be backported to 2.7.
Like for connection error code, when FD are dumps, the ssl error code is now
displayed. This may help to diagnose why a connection error occurred.
This patch may be backported to help debugging.
When FD are dumps, the connection error code is now displayed. This may help
to diagnose why a connection error occurred.
This patch may be backported to help debugging.
When HAproxy is stopping, the DNS resolutions must be stopped, except those
triggered from a "do-resolve" action. To do so, the resolutions themselves
cannot be destroyed, the current design is too complex. However, it is
possible to mute the resolvers tasks. The same is already performed with the
health-checks. On soft-stop, the tasks are still running periodically but
nothing if performed.
For the resolvers, when the process is stopping, before running a
resolution, we check all the requesters attached to this resolution. If t
least a request is a stream or if there is a requester attached to a running
proxy, a new resolution is triggered. Otherwise, we ignored the
resolution. It will be evaluated again on the next wakeup. This way,
"do-resolv" action are still working during soft-stop but other resoluation
are stopped.
Of course, it may be see as a feature and not a bug because it was never
performed. But it is in fact not expected at all to still performing
resolutions when HAProxy is stopping. In addution, a proxy option will be
added to change this behavior.
This patch partially fixes the issue #1874. It could be backported to 2.7
and maybe to 2.6. But no further.
On soft-stop, we must properlu stop backends and not only proxies with at
least a listener. This is mandatory in order to stop the health checks. A
previous fix was provided to do so (ba29687bc1 "BUG/MEDIUM: proxy: properly
stop backends"). However, only stop_proxy() function was fixed. When HAproxy
is stopped, this function is no longer used. So the same kind of fix must be
done on do_soft_stop_now().
This patch partially fixes the issue #1874. It must be backported as far as
2.4.
The ocsp-related CLI commands tend to work with OCSP_CERTIDs as well as
certificate paths so the path should also be added to the output of the
"show ssl ocsp-response" command when no certid or path is provided.
In order to increase usability, the "show ssl ocsp-response" also takes
a frontend certificate path as parameter. In such a case, it behaves the
same way as "show ssl cert foo.pem.ocsp".
If the last update before a deinit happens was successful, the pointer
to the httpclient in the ocsp update context was not reset while the
httpclient instance was already destroyed.
Instead of having a dedicated httpclient instance and its own code
decorrelated from the actual auto update one, the "update ssl
ocsp-response" will now use the update task in order to perform updates.
Since the cli command allows to update responses that were never
included in the auto update tree, a new flag was added to the
certificate_ocsp structure so that the said entry can be inserted into
the tree "by hand" and it won't be reinserted back into the tree after
the update process is performed. The 'update_once' flag "stole" a bit
from the 'fail_count' counter since it is the one less likely to reach
UINT_MAX among the ocsp counters of the certificate_ocsp structure.
This new logic required that every certificate_ocsp entry contained all
the ocsp-related information at all time since entries that are not
supposed to be configured automatically can still be updated through the
cli. The logic of the ssl_sock_load_ocsp was changed accordingly.
The dedicated proxy used for OCSP auto update is renamed OCSP-UPDATE
which should be more explicit than the previous HC_OCSP name. The
reference to the underlying httpclient is simply kept in the
documentation.
The certid is removed from the log line since it is not really
comprehensible and is replaced by the path to the corresponding frontend
certificate.
It is more a less a revert of the commit b65af26e1 ("MEDIUM: mux-pt: Don't
always set a final error on SE on the sending path"). The PT multiplexer is
so simple that an error on the sending path is terminal. Unlike other muxes,
there is no connection level here. However, instead of reporting an final
error by setting SE_FL_ERROR, we set SE_FL_EOS flag instead if a read0 was
received on the underlying connection. Concretely, it is always true with
the current design of the raw socket layer. But it is cleaner this way.
Without this patch, it is possible to block a TCP socket if a connection
error is triggered when data are sent (for instance a broken pipe) while the
upper stream does not expect to receive more data.
Note the patch above introduced a regression because errors handling at the
connection level is quite simple. All errors are final. But we must keep in
mind it may change. And if so, this will require to move back on a 2-step
errors handling in the mux-pt.
This patch must be backported to 2.7.
This one is printed as the iocb in the "show fd" output, and arguably
this wasn't very convenient as-is:
293 : st=0x000123(cl heopI W:sRa R:sRA) ref=0 gid=1 tmask=0x8 umask=0x0 prmsk=0x8 pwmsk=0x0 owner=0x7f488487afe0 iocb=0x50a2c0(main+0x60f90)
Let's unstatify it and export it so that the symbol can now be resolved
from the various points that need it.
This bug was revealed by h2load tests run as follows:
h2load -t 4 --npn-list h3 -c 64 -m 16 -n 16384 -v https://127.0.0.1:4443/
This open (-c) 64 QUIC connections (-n) 16384 h3 requets from (-t) 4 threads, i.e.
256 requests by connection. Such tests could not always pass and often ended with
such results displays by h2load:
finished in 53.74s, 38.11 req/s, 493.78KB/s
requests: 16384 total, 2944 started, 2048 done, 2048 succeeded, 14336
failed, 14336 errored, 0 timeout
status codes: 2048 2xx, 0 3xx, 0 4xx, 0 5xx
traffic: 25.92MB (27174537) total, 102.00KB (104448) headers (space
savings 1.92%), 25.80MB (27053569) data
UDP datagram: 3883 sent, 24330 received
min max mean sd ± sd
time for request: 48.75ms 502.86ms 134.12ms 75.80ms 92.68%
time for connect: 20.94ms 331.24ms 189.59ms 84.81ms 59.38%
time to 1st byte: 394.36ms 417.01ms 406.72ms 9.14ms 75.00%
req/s : 0.00 115.45 14.30 38.13 87.50%
The number of successful requests was always a multiple of 256.
Activating the traces also shew that some connections were blocked after having
successfully completed their handshakes due to the fact that the mux. The mux
is started upon the acceptation of the connection.
Under heavy load, some connections were never accepted. From the moment where
more than 4 (MAXACCEPT) connections were enqueued before a listener could be
woken up to accept at most 4 connections, the remaining connections were not
accepted ore lately at the second listener tasklet wakeup.
Add a call to tasklet_wakeup() to the accept list tasklet of the listeners to
wake up it if there are remaining connections to accept after having called
listener_accept(). In this case the listener must not be removed of this
accept list, if not at the next call it will not accept anything more.
Must be backported to 2.7 and 2.6.
In environments where SYSTEM_MAXCONN is defined when compiling, the
master will use this value instead of the original minimal value which
was set to 100. When this happens, the master process could allocate
RAM excessively since it does not need to have an high maxconn. (For
example if SYSTEM_MAXCONN was set to 100000 or more)
This patch fixes the issue by using the new define MASTER_MAXCONN which
define a default maxconn of 100 for the master process.
Must be backported as far as 2.5.
The goal is to send signals to random threads at random instants so that
they spin for a random delay in a relax() loop, trying to give back the
CPU to another competing hardware thread, in hope that from time to time
this can trigger in critical areas and increase the chances to provoke a
latent concurrency bug. For now none were observed.
For example, this command starts 64 such tasks waking after random delays
of 0-1ms and delivering signals to trigger such loops on 3 random threads:
for i in {1..64}; do
socat - /tmp/sock1 <<< "expert-mode on;debug dev delay-inj 2 3"
done
This command is only enabled when DEBUG_DEV is set at build time.
As mentioned in commit 237e6a0d6 ("BUG/MAJOR: fd/thread: fix race between
updates and closing FD"), a race was found during stress tests involving
heavy backend connection reuse with many competing closes.
Here the problem is complex. The analysis in commit f69fea64e ("MAJOR:
fd: get rid of the DWCAS when setting the running_mask") that removed
the DWCAS in 2.5 overlooked a few races.
First, a takeover from thread1 could happen just after fd_update_events()
in thread2 validates it holds the tmask bit in the CAS loop. Since thread1
releases running_mask after the operation, thread2 will succeed the CAS
and both will believe the FD is theirs. This does explain the occasional
crashes seen with h1_io_cb() being called on a bad context, or
sock_conn_iocb() seeing conn->subs vanish after checking it. This issue
can be addressed using a DWCAS in both fd_takeover() and fd_update_events()
as it was before the patch above but this is not portable to all archs and
is not easy to adapt for those lacking it, due to some operations still
happening only on individual masks after the thread groups were added.
Second, the checks after fd_clr_running() for the current thread being
the last one is not sufficient: at the exact moment the operation
completes, another thread may also set and drop the running bit and see
itself as alone, and both can call _fd_close_orphan() in parallel. In
order to prevent this from happening, we cannot rely on the absence of
others, we need an explicit flag indicating that the FD must be closed.
One approach that was attempted consisted in playing with the thread_mask
but that was not reliable since it could still match between the late
deletion and the early insertion that follows. Instead, a new FD flag
was added, FD_MUST_CLOSE, that exactly indicates that the call to
_fd_delete_orphan() must be done. It is set by fd_delete(), and
atomically cleared by the first one which checks it, and which is the
only one to call _fd_delete_orphan().
With both points addressed, there's no more visible race left:
- takeover() only happens under the connection list's lock and cannot
compete with fd_delete() since fd_delete() must first remove the
connection from the list before deleting the FD. That's also why it
doesn't need to call _fd_delete_orphan() when dropping its running
bit.
- takeover() sets its running bit then atomically replaces the thread
mask, so that until that's done, it doesn't validate the condition
to end the synchonization loop in fd_update_events(). Once it's OK,
the previous thread's bit is lost, and this is checked for in
fd_update_events()
- fd_update_events() can compete with fd_delete() at various places
which are explained above. Since fd_delete() clears the thread mask
as after setting its running bit and after setting the FD_MUST_CLOSE
bit, the synchronization loop guarantees that the thread mask is seen
before going further, and that once it's seen, the FD_MUST_CLOSE flag
is already present.
- fd_delete() may start while fd_update_events() has already started,
but fd_delete() must hold a bit in thread_mask before starting, and
that is checked by the first test in fd_update_events() before setting
the running_mask.
- the poller's _update_fd() will not compete against _fd_delete_orphan()
nor fd_insert() thanks to the fd_grab_tgid() that's always done before
updating the polled_mask, and guarantees that we never pretend that a
polled_mask has a bit before the FD is added.
The issue is very hard to reproduce and is extremely time-sensitive.
Some tests were required with a 1-ms timeout with request rates
closely matching 1 kHz per server, though certain tests sometimes
benefitted from saturation. It was found that adding the following
slowdown at a few key places helped a lot and managed to trigger the
bug in 0.5 to 5 seconds instead of tens of minutes on a 20-thread
setup:
{ volatile int i = 10000; while (i--); }
Particularly, placing it at key places where only one of running_mask
or thread_mask is set and not the other one yet (e.g. after the
synchronization loop in fd_update_events or after dropping the
running bit) did yield great results.
Many thanks to Olivier Houchard for this expert help analysing these
races and reviewing candidate fixes.
The patch must be backported to 2.5. Note that 2.6 does not have tgid
in FDs, and that it requires a change of output on fd_clr_running() as
we need the previous bit. This is provided by carefully backporting
commit d6e1987612 ("MINOR: fd: make fd_clr_running() return the previous
value instead"). Tests have shown that the lack of tgid is a showstopper
for 2.6 and that unless a better workaround is found, it could still be
preferable to backport the minimum pieces required for fd_grab_tgid()
to 2.6 so that it stays stable long.
In case too many thread groups are needed for the threads, we emit
an error indicating the problem. Unfortunately the threads and groups
counts were reversed.
This can be backported to 2.6.
The NUMA detection code tries not to interfer with any taskset the user
could have specified in init scripts. For this it compares the number of
CPUs available with the number the process is bound to. However, the CPU
count is retrieved after being applied an upper bound of MAX_THREADS, so
if the machine has more than 64 CPUs, the comparison always fails and
makes haproxy think the user has already enforced a binding, and it does
not pin it anymore to a single NUMA node.
This can be verified by issuing:
$ socat /path/to/sock - <<< "show info" | grep thread
On a dual 48-CPU machine it reports 64, implying that threads are allowed
to run on the second socket:
Nbthread: 64
With this fix, the function properly reports 96, and the output shows 48,
indicating that a single NUMA node was used:
Nbthread: 48
Of course nothing is changed when "no numa-cpu-mapping" is specified:
Nbthread: 64
This can be backported to 2.4.
This issue was revealed by "Multiple streams" QUIC tracker test which very often
fails (locally) with a file of about 1Mbytes (x4 streams). The log of QUIC tracker
revealed that from its point of view, the 4 files were never all received entirely:
"results" : {
"stream_0_rec_closed" : true,
"stream_0_rec_offset" : 1024250,
"stream_0_snd_closed" : true,
"stream_0_snd_offset" : 15,
"stream_12_rec_closed" : false,
"stream_12_rec_offset" : 72689,
"stream_12_snd_closed" : true,
"stream_12_snd_offset" : 15,
"stream_4_rec_closed" : true,
"stream_4_rec_offset" : 1024250,
"stream_4_snd_closed" : true,
"stream_4_snd_offset" : 15,
"stream_8_rec_closed" : true,
"stream_8_rec_offset" : 1024250,
"stream_8_snd_closed" : true,
"stream_8_snd_offset" : 15
},
But this in contradiction with others QUIC tracker logs which confirms that haproxy
has really (re)sent the stream at the suspected offset(stream_12_rec_offset):
1152085,
"transport",
"packet_received",
{
"frames" : [
{
"frame_type" : "stream",
"length" : "155",
"offset" : "72689",
"stream_id" : "12"
}
],
"header" : {
"dcid" : "a14479169ebb9dba",
"dcil" : "8",
"packet_number" : "466",
"packet_size" : 190
},
"packet_type" : "1RTT"
}
When detected as losts, the packets are enlisted, then their frames are
requeued in their packet number space by qc_requeue_nacked_pkt_tx_frms().
This was done using a local list which was spliced to the packet number
frame list. This had as bad effect to retransmit the frames in the inverse
order they have been sent. This is something the QUIC tracker go client
does not like at all!
Removing the frame splicing fixes this issue and allows haproxy to pass the
"Multiple streams" test.
Must be backported to 2.7.
Normally the task_wakeup() in sock_conn_io_cb() is expected to
happen on the same thread the FD is attached to. But due to the
way the code was arranged in the past (with synchronous callbacks)
we continue to update connections after the wakeup, which always
makes the reader have to think deeply whether it's possible or not
to call another thread there. Let's just move the tasklet_wakeup()
at the end to make sure there's no problem with that.
This bug arrived with this commit:
b5a8020e9 MINOR: quic: RETIRE_CONNECTION_ID frame handling (RX)
and was revealed by h3 interop tests with clients like s2n-quic and quic-go
as noticed by Amaury.
Indeed, one must check that the CID matching the sequence number provided by a received
RETIRE_CONNECTION_ID frame does not match the DCID of the packet.
Remove useless ->curr_cid_seq_num member from quic_conn struct.
The sequence number lookup must be done in qc_handle_retire_connection_id_frm()
to check the validity of the RETIRE_CONNECTION_ID frame, it returns the CID to be
retired into <cid_to_retire> variable passed as parameter to this function if
the frame is valid and if the CID was not already retired
Must be backported to 2.7.
Since the following commit :
commit fb375574f9
MINOR: quic: mark quic-conn as jobs on socket allocation
quic-conn instances are marked as jobs. This prevent haproxy process to
stop while there is transfer in progress. To not delay process
termination, idle connections are woken up through their MUX instances
to be able to release them immediately.
However, there is no mechanism to wake up quic connections left on
closing or draining state. This means that haproxy process termination
is delayed until every closing quic connections timer has expired.
To improve this, a new function quic_handle_stopping() is called when
haproxy process is stopping. It simply wakes up the idle timer task of
all connections in the global closing list. These connections will thus
be released immediately to not interrupt haproxy process stopping.
This should be backported up to 2.7.
A new global quic-conn list has been added by the previous patch. It will
contain every quic-conn in closing or draining state.
Thus, it is now easier to include or skip them on a "show quic" output :
when the default list on the current thread has been browsed entirely,
either we skip to the next thread or we look at the closing list on the
current thread.
This should be backported up to 2.7.
When a CONNECTION_CLOSE is emitted or received, a QUIC connection enters
respectively in draining or closing state. These states are a loose
equivalent of TCP TIME_WAIT. No data can be exchanged anymore but the
connection is maintained during a certain timer to handle packet
reordering or loss.
A new global list has been defined for QUIC connections in
closing/draining state inside thread_ctx structure. Each time a
connection enters in one of this state, it will be moved from the
default global list to the new closing list.
The objective of this patch is to quickly filter connections on
closing/draining. Most notably, this will be used to wake up these
connections and avoid that haproxy process stopping is delayed by them.
A dedicated function qc_detach_th_ctx_list() has been implemented to
transfer a quic-conn from one list instance to the other. This takes
care of back-references attach to a quic-conn instance in case of a
running "show quic".
This should be backported up to 2.7.
This patch adds the support for the PS algorithms when verifying JWT
signatures (rsa-pss). It was not managed during the first implementation
and previously raised an "Unmanaged algorithm" error.
The tests use the same rsa signature as the plain rsa tests (RS256 ...)
and the implementation simply adds a call to
EVP_PKEY_CTX_set_rsa_padding in the function that manages rsa and ecdsa
signatures.
The signatures in the reg-test were built thanks to the PyJWT python
library once again.
With 737d10f ("BUG/MEDIUM: dns: ensure ring offset is properly reajusted
to head") relative offset calculation was fixed in dns_session_io_handler()
and dns_process_req() functions.
But if we compare with the changes performed in the patch that introduced
the bug: d9c7188 ("MEDIUM: ring: make the offset relative to the head/tail
instead of absolute"), we can see that dns_resolve_send() is missing from
the patch.
Applying both 737d10f + ("BUG/MINOR: dns: fix ring offset calculation on
first read") to dns_resolve_send() function.
With this last commit, we should be back at pre d9c7188 behavior.
No backport needed.
With 737d10f ("BUG/MEDIUM: dns: ensure ring offset is properly reajusted
to head") ring offset is now properly re-adjusted in dns_session_io_handler()
and dns_process_req().
But the previous patch does not cope well if the first read is performed
on a non-empty ring since relative ofs will be computed from ds->ofs=0 or
dss->ofs_req=0.
In this case: relative offset could become invalid since we mix up relative
offsets with absolute offsets.
To fix this, we apply the same logic performed in d9c7188 ("MEDIUM: ring:
make the offset relative to the head/tail instead of absolute") for the
cli_io_handler_show_ring() function: that is using b_peek_ofs(buf, 0) to
set the contextual offset instead of hard-coding it to 0.
This should be considered as a minor bugfix since this bug was discovered by
reading the code: 737d10f already survived a good amount of stress-tests as
shown in GH #2068.
No backport needed as 737d10f is not marked for backports.
Since d9c7188 ("MEDIUM: ring: make the offset relative to the head/tail instead
of absolute"), ring offset calculation has changed: we don't rely on ring->ofs
absolute offset anymore.
But with the above patch, relative offset is not properly calculated in
sink_forward_oc_io_handler() and sink_forward_io_handler().
The issue here is the same as 737d10f ("BUG/MEDIUM: dns: ensure ring offset is
properly reajusted to head") since dns and sink_forward share the same
ring logic:
When the ring is becoming full, ring_write() will try to regain some space to
insert new data by calling b_del() on older messages. Here b_del() moves
buffer's head under the hood, and since ring->ofs cannot be used to "correct"
the relative offset, both sink_forward_oc_io_handler() and
sink_forward_io_handler() start to get invalid offset.
At this point, we will suffer from ring data corruption resulting in unexpected
behavior or process crashes.
This can be easily demonstrated with the following test:
|log-forward syslog
| dgram-bind 127.0.0.1:5114
| log ring@logbuffer local0
|
|ring logbuffer
| format rfc5424
| size 16384
| server logserver 127.0.0.1:5114
Haproxy will forward incoming logs on udp@127.0.0.1:5114 to
tcp@127.0.0.1:5114
Then use the following tcp server:
nc -l -p 5114
With the following udp log sender:
|while [ 1 ]
|do
| logger --udp --server 127.0.0.1 -P 5114 -p user.warn "Test 7"
|done
Once the ring buffer is full (it takes less that a second to fill the 16k
buffer) haproxy starts to misbehave and the log forwarding stops.
We apply the same fix as in 737d10f ("BUG/MEDIUM: dns: ensure ring offset is
properly reajusted to head").
Please note the ~0 case that is handled slightly differently in this patch:
this is required to properly start reading from a non-empty ring. This case
will be fixed in dns related code in the following patch.
This does not need to be backported as d9c7188 was not marked for backports.
Modify quic_transport_params_dump() and others function relative to the
transport parameters value dump from TRACE() to make their output more
compact.
Add call to quic_transport_params_dump() to dump the transport parameters
from "show quic" CLI command.
Must be backported to 2.7.
Add QUIC_FL_RX_PACKET_SPIN_BIT new RX packet flag to mark an RX packet as having
the spin bit set. Idem for the connection with QUIC_FL_CONN_SPIN_BIT flag.
Implement qc_handle_spin_bit() to set/unset QUIC_FL_CONN_SPIN_BIT for the connection
as soon as a packet number could be deciphered.
Modify quic_build_packet_short_header() to set the spin bit when building
a short packet header.
Validated by quic-tracker spin bit test.
Must be backported to 2.7.
Add ->curr_cid_seq_num new quic_conn struct frame to store the connection
ID sequence number currently used by the connection.
Implement qc_handle_retire_connection_id_frm() to handle this RX frame.
Implement qc_retire_connection_seq_num() to remove a connection ID from its
sequence number.
Implement qc_build_new_connection_id_frm to allocate a new NEW_CONNECTION_ID
frame from a CID.
Modify qc_parse_pkt_frms() which parses the frames of an RX packet to handle
the case of the RETIRE_CONNECTION_ID frame.
Must be backported to 2.7.
Add ->next_cid_seq_num new member to quic_conn struct to store the next
connection ID to be used to alloacated a connection ID.
It is initialized to 0 from qc_new_conn() which initializes a connection.
Modify new_quic_cid() to use this variable each time it is called without
giving the possibility to the caller to pass the sequence number for the
connection to be allocated.
Modify quic_build_post_handshake_frames() to use ->next_cid_seq_num
when building NEW_CONNECTION_ID frames after the hanshake has been completed.
Limit the number of connection IDs provided to the peer to the minimum
between 4 and the value it sent with active_connection_id_limit transport
parameter. This includes the connection ID used by the connection to send
this new connection IDs.
Must be backported to 2.7.
A peer must not send active_connection_id_limit values smaller than 2
which is also the minimum value when not sent.
Make the transport parameters decoding fail in this case.
Must be backported to 2.7.
STREAM frame retransmission has been recently fixed. A new boolean field
<dup> was created for quic_stream frame type. It is set for duplicated
STREAM frame to ensure extra checks on the underlying buffer are
conducted before sending the frame. All of this has been implemented by
this commit :
315a4f6ae5
BUG/MEDIUM: quic: do not crash when handling STREAM on released MUX
However, the above commit is incomplete. In the MUX code, when a new
STREAM frame is created, <dup> is left uninitialized. In most cases this
is harmless as it will only add extra unneeded checks before sending the
frame. So this is mainly a performance issue.
There is however one case where this bug will lead to a crash : when the
response consists only of an empty STREAM frame. In this case, the empty
frame will be silently removed as it is incorrectly assimilated to an
already acked frame range in qc_build_frms(). This can trigger a
BUG_ON() on the MUX code as a qcs instance is still in the send list
after qc_send_frames() invocation.
Note that this is extremely rare to have only an empty STREAM frame. It
was reproduced with HTTP/0.9 where no HTTP status line exists on an
empty body. I do not know if this is possible on HTTP/3 as a status line
should be present each time in a HEADERS frame.
Properly initialize <dup> field to 0 on each STREAM frames generated by
the QUIC MUX to fix this issue.
This crash may be linked to github issue #2049.
This should be backported up to 2.6.
Since the below patch, ring offset calculation for readers has changed.
commit d9c7188633
MEDIUM: ring: make the offset relative to the head/tail instead of absolute
For readers, this requires to adjust their offsets to be relative to the
ring head each time read is resumed. Indeed, buffer head can change any
time a ring_write() is performed after older entries were purged.
This operation was not performed on the DNS code which causes the offset
to become invalid. In most cases, the following BUG_ON() was triggered :
FATAL: bug condition "msg_len + ofs + cnt + 1 > b_data(buf)" matched
at src/dns.c:522
Fix this by adjusting DNS reader offsets when entering
dns_session_io_handler() and dns_process_req().
This bug was reproduced by using a backend with 10 servers using SRV
record resolution on a single resolvers section. A BUG_ON() crash would
occur after less than 5 minutes of process execution.
This does not need to be backported as the above patch is not.
This should fix github issue #2068.
While running some L7 retries tests, Christopher and I stumbled upon a
very strange behavior showing some occasional server timeouts when the
server closes keep-alive connections quickly. The issue can be
reproduced with the following config:
global
expose-experimental-directives
#tune.fd.edge-triggered on # can speed up the issue
defaults
mode http
timeout client 5s
timeout server 10s
timeout connect 2s
listen f
bind :8001
http-reuse always
retry-on all-retryable-errors
server next 127.0.0.1:8002
frontend b
bind :8002
timeout http-keep-alive 1 # one ms
redirect location /
Sending fast requests without reusing the client connection on port 8001
with a single connection and at least 3 threads on haproxy occasionally
shows some glitches pauses (below with timeout server 2s):
$ taskset -c 2,3 h1load -e -t 1 -r 1 -c 1 http://127.0.0.1:8001/
# time conns tot_conn tot_req tot_bytes err cps rps bps ttfb
1 1 9794 9793 959714 0 9k79 9k79 7M67 42.94u
2 1 9794 9793 959714 0 0.00 0.00 0.00 -
3 1 9794 9793 959714 0 0.00 0.00 0.00 -
4 0 16015 16015 1569470 0 6k22 6k22 4M87 522.9u
5 0 18657 18656 1828190 2 2k63 2k63 2M06 39.22u
If this doesn't happen, limiting to a request rate close to 1/timeout
may help.
What is happening is that after several migrations, a late report
via fd_update_events() may detect that the thread is not welcome, and
will want to program an update so that the current thread's poller
disables its polling on it. It is allowed to do so because it used
fd_grab_tgid(). But what if _fd_delete_orphan() was just starting to
be called and already reset the update_mask ? We'll end up with a bit
present in the update mask, then _fd_delete_orphan() resets the tgid,
which will prevent the poller from consuming that update. The update
is not needed anymore since the FD was closed, but in this case nobody
will clear this bit until the same FD is reused again and cleared. And
as long as the thread's bit remains in the update_mask, no new updates
will be programmed for the next use of this FD on the same thread since
due to the bit being present, fd_nbupdt will not be changed. This is
what is causing this timeout.
The fix consists in making sure _fd_delete_orphan() waits for the
occasional watchers to leave, and to do this before clearing the
update_mask. This will be either fd_update_events() trying to check
its thread_mask, or the poller checking its updates, so that's pretty
short. But it definitely closes this race.
This fix is needed since the introduction of fd_grab_tgid(), hence 2.7.
Note that while testing the fix, another related issue concerning the
atomicity of running_mask vs thread_mask popped up and will have to be
fixed till 2.5 as part of another patch. It may make the tests for this
fix occasionally tigger a few BUG_ON() or face a null conn->subs in
sock_conn_iocb(), though these ones are much more difficult to trigger.
This is not caused by this fix.
The MUX instance is released before its quic-conn counterpart. On
termination, a H3 GOAWAY is emitted to prevent the client to open new
streams for this connection.
The quic-conn instance will stay alive until all opened streams data are
acknowledged. If the client tries to open a new stream during this
interval despite the GOAWAY, quic-conn is responsible to request its
immediate closure with a STOP_SENDING + RESET_STREAM.
This behavior was already implemented but the received packet with the
new STREAM was never acknowledged. This was fixed with the following
commit :
commit 156a89aef8
BUG/MINOR: quic: acknowledge STREAM frame even if MUX is released
However, this patch introduces a regression as it did not skip the call
to qc_handle_strm_frm() despite the MUX instance being released. This
can cause a segfault when using qcc_get_qcs() on a released MUX
instance. To fix this, add a missing break statement which will skip
qc_handle_strm_frm() when the MUX instance is not initialized.
This commit was reproduced using a short timeout client and sending
several requests with delay between them by using a modified aioquic. It
produces a crash with the following backtrace :
#0 0x000055555594d261 in __eb64_lookup (x=4, root=0x7ffff4091f60) at include/import/eb64tree.h:132
#1 eb64_lookup (root=0x7ffff4091f60, x=4) at src/eb64tree.c:37
#2 0x000055555563fc66 in qcc_get_qcs (qcc=0x7ffff4091dc0, id=4, receive_only=1, send_only=0, out=0x7ffff780ca70) at src/mux_quic.c:668
#3 0x0000555555641e1a in qcc_recv (qcc=0x7ffff4091dc0, id=4, len=40, offset=0, fin=1 '\001', data=0x7ffff40c4fef "\001&") at src/mux_quic.c:974
#4 0x0000555555619d28 in qc_handle_strm_frm (pkt=0x7ffff4088e60, strm_frm=0x7ffff780cf50, qc=0x7ffff7cef000, fin=1 '\001') at src/quic_conn.c:2515
#5 0x000055555561d677 in qc_parse_pkt_frms (qc=0x7ffff7cef000, pkt=0x7ffff4088e60, qel=0x7ffff7cef6c0) at src/quic_conn.c:3050
#6 0x00005555556230aa in qc_treat_rx_pkts (qc=0x7ffff7cef000, cur_el=0x7ffff7cef6c0, next_el=0x0) at src/quic_conn.c:4214
#7 0x0000555555625fee in quic_conn_app_io_cb (t=0x7ffff40c1fa0, context=0x7ffff7cef000, state=32848) at src/quic_conn.c:4640
#8 0x00005555558a676d in run_tasks_from_lists (budgets=0x7ffff780d470) at src/task.c:596
#9 0x00005555558a725b in process_runnable_tasks () at src/task.c:876
#10 0x00005555558522ba in run_poll_loop () at src/haproxy.c:2945
#11 0x00005555558529ac in run_thread_poll_loop (data=0x555555d14440 <ha_thread_info+64>) at src/haproxy.c:3141
#12 0x00007ffff789ebb5 in ?? () from /usr/lib/libc.so.6
#13 0x00007ffff7920d90 in ?? () from /usr/lib/libc.so.6
This should fix github issue #2067.
This must be backported up to 2.6.
In very very rare cases, it is possible the Initial packet number space
must be probed even if it there is no more in flight CRYPTO frames.
In such cases, a PING frame is sent into an Initial packet. As this
packet is ack-eliciting, it must be padded by the server. qc_do_build_pkt()
is modified to do so.
Take the opportunity of this patch to modify the trace for TX frames to
easily distinguished them from other frame relative traces.
Must be backported to 2.7.
Mark the connection as limited by the anti-amplification limit when trying to
probe the peer.
Wakeup the connection PTO/dectection loss timer as soon as a datagram is
received. This was done only when the datagram was dropped.
This fixes deadlock issues revealed by some interop runner tests.
Must be backported to 2.7 and 2.6.
Some frames are marked as already acknowledged from duplicated packets
whose the original packet has been acknowledged. There is no need
to resend such packets or frames.
Implement qc_pkt_with_only_acked_frms() to detect packet with only
already acknowledged frames inside and use it from qc_prep_fast_retrans()
which selects the packet to be retransmitted.
Must be backported to 2.6 and 2.7.
Even if there is a check in callers of qc_prep_hdshk_fast_retrans() and
qc_prep_fast_retrans() to prevent retransmissions of packets with no ack-eliciting
frames, these two functions should pay attention not do to that especially if
someone decides to modify their implementations in the future.
Must be backported to 2.6 and 2.7.
This is an old bug which arrived in this commit due to a misinterpretation
of the RFC I guess where the desired effect was to acknowledge all the
handshake packets:
77ac6f566 BUG/MINOR: quic: Missing acknowledgments for trailing packets
This had as bad effect to acknowledge all the handshake packets even the
ones which are not ack-eliciting.
Must be backported to 2.7 and 2.6.
Dump the secret used to derive the next one during a key update initiated by the
client and dump the resulted new secret and the new key and iv to be used to
decryption Application level packets.
Also add a trace when the key update is supposed to be initiated on haproxy side.
This has already helped in diagnosing an issue evealed by the key update interop
test with xquic as client.
Must be backported to 2.7.
v2 interop runner test revealed this bug as follows:
[01|quic|4|c_conn.c:4087] new packet : qc@0x7f62ec026e30 pkt@0x7f62ec056390 el=I pn=491940080 rel=H
[01|quic|5|c_conn.c:1509] qc_pkt_decrypt(): entering : qc@0x7f62ec026e30
[01|quic|0|c_conn.c:1553] quic_tls_decrypt() failed : qc@0x7f62ec026e30
[01|quic|5|c_conn.c:1575] qc_pkt_decrypt(): leaving : qc@0x7f62ec026e30
[01|quic|0|c_conn.c:4091] packet decryption failed -> dropped : qc@0x7f62ec026e30 pkt@0x7f62ec056390 el=I pn=491940080
Only v2 Initial packets decryption received by the clients were impacted. There
is no issue to encrypt v2 Initial packets. This is due to the fact that when
negotiated the client may send two versions of Initial packets (currently v1,
then v2). The selection was done for the TX path but not on the RX path.
Implement qc_select_tls_ctx() to select the correct TLS cipher context for all
types of packets and call this function before removing the header protection
and before deciphering the packet.
Must be backported to 2.7.
When retransmitting datagrams with two coalesced packets inside, the second
packet was not taken into consideration when checking there is enough space
into the network for the datagram, especially when limited by the anti-amplification.
Must be backported to 2.6 and 2.7.
Before building a packet into a datagram, ensure there is sufficient space for at
least 1200 bytes. Also pad datagrams with only one ack-eliciting Initial packet
inside.
Must be backported to 2.7 and 2.6.
Making http_7239_valid_obfsc() inline because it is only called by inline
functions.
Removing dead comment and documenting proxy_http_parse_{7239,xff,xot} functions.
No backport needed.
Anonymization mode has two CLI handlers "set anon <on|off>" and "set
anon global-key". The last one only requires admin level. However, as
cli_find_kw() is implemented, only the first handler will be retrieved
as they both start with the same prefix "set anon".
This has the effect to execute the wrong handler for "set anon
global-key" with an error message about an invalid keyword. To fix this,
handlers definition have been separated for both "set anon on" and "set
anon off" commands. This allows to have minimal changes while keeping
the same "set anon" prefix for each commands.
Also take this opportunity to fix a reference to a non-existing "set
global-key" CLI handler in the documentation.
This must be backported up to 2.7.
When a STREAM frame is re-emitted, it will point to the same stream
buffer as the original one. If an ACK is received for either one of
these frame, the underlying buffer may be freed. Thus, if the second
frame is declared as lost and schedule for retransmission, we must
ensure that the underlying buffer is still allocated or interrupt the
retransmission.
Stream buffer is stored as an eb_tree indexed by the stream ID. To avoid
to lookup over a tree each time a STREAM frame is re-emitted, a lost
STREAM frame is flagged as QUIC_FL_TX_FRAME_LOST.
In most cases, this code is functional. However, there is several
potential issues which may cause a segfault :
- when explicitely probing with a STREAM frame, the frame won't be
flagged as lost
- when splitting a STREAM frame during retransmission, the flag is not
copied
To fix both these cases, QUIC_FL_TX_FRAME_LOST flag has been converted
to a <dup> field in quic_stream structure. This field is now properly
copied when splitting a STREAM frame. Also, as this is now an inner
quic_frame field, it will be copied automatically on qc_frm_dup()
invocation thus ensuring that it will be set on probing.
This issue was encounted randomly with the following backtrace :
#0 __memmove_avx512_unaligned_erms ()
#1 0x000055f4d5a48c01 in memcpy (__len=18446698486215405173, __src=<optimized out>,
#2 quic_build_stream_frame (buf=0x7f6ac3fcb400, end=<optimized out>, frm=0x7f6a00556620,
#3 0x000055f4d5a4a147 in qc_build_frm (buf=buf@entry=0x7f6ac3fcb5d8,
#4 0x000055f4d5a23300 in qc_do_build_pkt (pos=<optimized out>, end=<optimized out>,
#5 0x000055f4d5a25976 in qc_build_pkt (pos=0x7f6ac3fcba10,
#6 0x000055f4d5a30c7e in qc_prep_app_pkts (frms=0x7f6a0032bc50, buf=0x7f6a0032bf30,
#7 qc_send_app_pkts (qc=0x7f6a0032b310, frms=0x7f6a0032bc50) at src/quic_conn.c:4184
#8 0x000055f4d5a35f42 in quic_conn_app_io_cb (t=0x7f6a0009c660, context=0x7f6a0032b310,
This should fix github issue #2051.
This should be backported up to 2.6.
In the OCSP response callback, instead of using the actual date of the
system, the scheduler's 'now' timer is used when checking a response's
validity.
This patch can be backported to all stable versions.
When adding a new certificate through the CLI and appending it to a
crt-list with the 'ocsp-update' option set, the new certificate would
not be added to the OCSP response update list.
The only thing that was missing was the copy of the ocsp_update mode
from the ssl_bind_conf into the ckch_store's object.
An extra wakeup of the update task also needed to happen in case the
newly inserted entry needs to be updated before the next wakeup of the
task.
This patch does not need to be backported.
The minimum and maximum delays between two automatic updates of a given
OCSP response can now be set via global options. It allows to limit the
update rate of OCSP responses for configurations that use many frontend
certificates with the ocsp-update option set if the updates are deemed
too costly.
A new format option can be passed to the "show ssl ocsp-response" CLI
command to dump the contents of an OCSP response in base64. This is
needed because thanks to the new OCSP auto update mechanism, we could
end up using an OCSP response internally that was never provided by the
user.
In case of successive OCSP update errors for a given OCSP response, the
retry delay will be multiplied by 2 for every new failure in order to
avoid retrying too often to update responses for which the responder is
unresponsive (for instance). The maximum delay will still be taken into
account so the OCSP update requests will wtill be sent at least every
hour.
Instead of using the same proxy as other http client calls (through lua
for instance), the OCSP update will use a dedicated proxy which will
enable it to change the log format and log conditions (for instance).
This proxy will have the NOLOGNORM option and regular logging will be
managed by the update task itself because in order to dump information
related to OCSP updates, we need to control the moment when the logs are
emitted (instead or relying on the stream's life which is decorrelated
from the update itself).
The update task then calls sess_log directly, which uses a dedicated
ocsp logformat that fetches specific OCSP data. Sess_log was preferred
to the more low level app_log because it offers the strength of
"regular" sample fetches and allows to add generic information alongside
OCSP ones in the log line.
In case of connection error (unreachable server for instance), a regular
httpclient log line will also be emitted. This line will have some extra
HTTP related info that can't be provided by the ocsp update logging
mechanism.
This patch adds a series of sample fetches that rely on the specified
OCSP update context structure. They will then be of use only in the
context of an ongoing OCSP update.
They cannot be used directly in the configuration so they won't be made
public. They will be used in the OCSP update's specific log format which
should be emitted by the update task itself in a future patch.
This command can be used to dump information about the entries contained
in the ocsp update tree. It will display one line per concerned OCSP
response and will contain the expected next update time as well as the
time of the last successful update, and the number of successful and
failed attempts.
In order to have some information about the frontend certificate when
dumping the contents of the ocsp update tree from the cli, we could
either keep a reference to a ckch_store in the certificate_ocsp
structure, which might cause some dangling reference problems, or
simply copy the path to the certificate in the ocsp response structure.
This latter solution was chosen because of its simplicity.
Those new specific error codes will enable to know a bit better what
went wrong during and OCSP update process. They will come to use in
future sample fetches as well as in debugging means (via the cli or
future traces).
In case of allocation error during the construction of an OCSP request
for instance, we would have ended reinserting the ocsp entry at the same
place in the ocsp update tree which could potentially lead to an
"endless" loop of errors in ssl_ocsp_update_responses. In such a case,
entries are now reinserted further in the tree (1 minute later) in order
to avoid such a chain of alloc failure.
When an abort is detected before all headers were received, and if there are
pending incoming data, we must report a parsing error instead of a
connection abort. This way it will be able to be handled as an invalid
message by HTTP analyzers instead of an early abort with no message.
It is especially important to be accurate on L7 retry. Indeed, without this
fix, this case will be handle by the "empty-response" retries policy while a
retry on "junk-response" is more accurate.
This patch must be backported to 2.7.
A recent fix (af124360e "BUG/MEDIUM: http-ana: Detect closed SC on opposite side
during body forwarding") was pushed to handle to sync a side when the opposite
one is in closing state. However, sometimes, the synchro is performed too early,
preventing a L7 retry to be performed.
Indeed, while the above fix is valid on the reponse side. On the request side,
if the response was not yet received, we must wait before closing.
So, to fix the fix, on the request side, we at least wait the response was
received before finishing the request analysis. Of course, if there is an error,
an abort or anything wrong on the server side, the response analyser should
handle it.
This patch is related to #2061. No backport needed.
A regression about "empty-response" L7 retry was introduced with the commit
dd6496f591 ("CLEANUP: http-ana: Remove useless if statement about L7
retries").
The if statetement was removed on a wrong assumption. Indeed, L7 retries on
status is now handled in the HTTP analysers. Thus, the stream-connector
(formely the conn-stream, and before again the stream-interface) no longer
report a read error to force a retry. But it is still possible to get a read
error with no response. In this case, we must perform a retry is
"empty-response" is enabled.
So the if statement is re-introduced, reverting the cleanup.
This patch should fix the issue #2061. It must be backported as far as 2.4.
When we are about to perform a L7 retry, we deal with the conn_retries
counter, to be sure we can retry. However, there is an issue here because
the counter is incremented before it is checked against the backend
limit. So, we can miss a connection retry.
Of course, we must invert both operation. The conn_retries counter must be
incremented after the check agains the backend limit.
This patch must be backported as far as 2.6.
This patch completes the previous one with poller subscribe of quic-conn
owned socket on sendto() error. This ensures that mux-quic is notified
if waiting on sending when a transient sendto() error is cleared. As
such, qc_notify_send() is called directly inside socket I/O callback.
qc_notify_send() internal condition have been thus completed. This will
prevent to notify upper layer until all sending condition are fulfilled:
room in congestion window and no transient error on socket FD.
This should be backported up to 2.7.
On sendto() transient error, prior to this patch sending was simulated
and we relied on retransmission to retry sending. This could hurt
significantly the performance.
Thanks to quic-conn owned socket support, it is now possible to improve
this. On transient error, sending is interrupted and quic-conn socket FD
is subscribed on the poller for sending. When send is possible,
quic_conn_sock_fd_iocb() will be in charge of restart sending.
A consequence of this change is on the return value of qc_send_ppkts().
This function will now return 0 on transient error if quic-conn has its
owned socket. This is used to interrupt sending in the calling function.
The flag QUIC_FL_CONN_TO_KILL must be checked to differentiate a fatal
error from a transient one.
This should be backported up to 2.7.
Sending is implemented in two parts on quic-conn module. First, QUIC
packets are prepared in a buffer and then sendto() is called with this
buffer as input.
qc.tx.buf is used as the input buffer. It must always be empty before
starting to prepare new packets in it. Currently, this is guarantee by
the fact that either sendto() is completed, a fatal error is encountered
which prevent future send, or a transient error is encountered and we
rely on retransmission to send the remaining data.
This will change when poller subscribe of socket FD on sendto()
transient error will be implemented. In this case, qc.tx.buf will not be
emptied to resume sending when the transient error is cleared. To allow
the current sending process to work as expected, a new function
qc_purge_txbuf() is implemented. It will try to send remaining data
before preparing new packets for sending. If successful, txbuf will be
emptied and sending can continue. If not, sending will be interrupted.
This should be backported up to 2.7.
Implement qc_notify_send(). This function is responsible to notify the
upper layer subscribed on SUB_RETRY_SEND if sending condition are back
to normal.
For the moment, this patch has no functional change as only congestion
window room is checked before notifying the upper layer. However, this
will be extended when poller subscribe of socket on sendto() error will
be implemented. qc_notify_send() will thus be responsible to ensure that
all condition are met before wake up the upper layer.
This should be backported up to 2.7.
This patch simply clean up return paths used in various send function of
quic-conn module. This will simplify the implementation of poller
subscribing on sendto() error which add another error handling path.
This should be backported up to 2.7.
The Content-Length header is always added into the request for an HTTP
health-check. However, when there is no payload, this header may be skipped
for OPTIONS, GET, HEAD and DELETE methods. In fact, it is a "SHOULD NOT" in
the RCF 9110 (#8.6).
It is not really an issue in itself but it seems to be an issue for AWS
ELB. It returns a 400-Bad-Request if a HEAD/GET request with no payload
contains a Content-Length header.
So, it is better to skip this header when possible.
This patch should fix the issue #2026. It could be backported as far as 2.2.
When the HTTP request of a health-check is forged, we must not pretend there
is no payload, by setting HTX_SL_F_BODYLESS, if a log-format body was
configured.
Indeed, a test on the body length was used but it is only valid for a plain
string. For A log-format string, a list is used. Note it an bug with no
consequence for now.
This patch must be backported as far as 2.2.
If the response is closed before any data was received, we must not report
an error to the SE descriptor. It is important to be able to retry on an
empty response.
This patch should fix the issue #2061. It must be backported to 2.7.
When a connection is removed from the safe list or the idle list,
CO_FL_SAFE_LIST and CO_FL_IDLE_LIST flags must be cleared. It is performed
when the connection is reused. But not when it is moved into the
toremove_conns list. It may be an issue because the multiplexer owning the
connection may be woken up before the connection is really removed. If the
connection flags are not sanitized, it may think the connection is idle and
reinsert it in the corresponding list. From this point, we can imagine
several bugs. An UAF or a connection reused with an invalid state for
instance.
To avoid any issue, the connection flags are sanitized when an idle
connection is moved into the toremove_conns list. The same is performed at
right places in the multiplexers. Especially because the connection release
may be delayed (for h2 and fcgi connections).
This patch shoudld fix the issue #2057. It must carefully be backported as
far as 2.2. Especially on the 2.2 where the code is really different. But
some conflicts should be expected on the 2.4 too.
EBADF on sendto() is considered as a fatal error. As such, it is removed
from the list of the transient errors. The connection will be killed
when encountered.
For the record, EBADF can be encountered on process termination with the
listener socket.
This should be backported up to 2.7.
Send is conducted through qc_send_ppkts() for a QUIC connection. There
is two types of error which can be encountered on sendto() or affiliated
syscalls :
* transient error. In this case, sending is simulated with the remaining
data and retransmission process is used to have the opportunity to
retry emission
* fatal error. If this happens, the connection should be closed as soon
as possible. This is done via qc_kill_conn() function. Until this
patch, only ECONNREFUSED errno was considered as fatal.
Modify the QUIC send API to be able to differentiate transient and fatal
errors more easily. This is done by fixing the return value of the
sendto() wrapper qc_snd_buf() :
* on fatal error, a negative error code is returned. This is now the
case for every errno except EAGAIN, EWOULDBLOCK, ENOTCONN, EINPROGRESS
and EBADF.
* on a transient error, 0 is returned. This is the case for the listed
errno values above and also if a partial send has been conducted by
the kernel.
* on success, the return value of sendto() syscall is returned.
This commit will be useful to be able to handle transient error with a
quic-conn owned socket. In this case, the socket should be subscribed to
the poller and no simulated send will be conducted.
This commit allows errno management to be confined in the quic-sock
module which is a nice cleanup.
On a final note, EBADF should be considered as fatal. This will be the
subject of a next commit.
This should be backported up to 2.7.
thread_set_first_group() and thread_set_first_tmask() were modified
and renamed to instead return the number and mask of the nth group.
Passing zero continues to return the first one, but it will be more
convenient to use this way when building shards.
The listeners have a thr_conn[] array indexed on the thread number that
is used during connection redispatching to know what threads are the least
loaded. Since we introduced thread groups, and based on the fact that a
listener may only belong to one group, there's no point storing counters
for all threads, we just need to store them for all threads in the group.
Doing so reduces the struct listener from 1500 to 632 bytes. This may be
backported to 2.7 to save a bit of resources.
There's currently a problem affecting thread groups. Stopping a listener
from a different group than the one that runs this listener will trigger
the BUG_ON() in fd_delete(). This typically happens by issuing "disable
frontend f" on the CLI for the following config since the CLI runs on
group 1:
global
nbthread 2
thread-groups 2
stats socket /tmp/sock1 level admin
frontend f
mode http
bind abns@frt-sock thread 2
This happens because abns sockets cannot be suspended so here this
requires a full stop.
A first approach would consist in isolating the caller during such rare
operations but it turns out that fd_delete() is not robust against even
such calling conditions, because it uses its own thread mask with an FD
that may be in a different group, and even though the threads would be
isolated and running_mask should be zero, we must not mix thread masks
from different groups like this.
A better solution consists in replacing the bug condition detection with
a self-protection. After all it's not trivial to figure all likely call
places, and forcing upper layers to protect the code is not clean if we
can do it at the bottom. Thus this is what is being done now. We detect
a thread group mismatch, and if so, we forcefully isolate ourselves and
entirely clean the socket. This has the merit of being much more robust
and easier to use (and harder to misuse). Given that such operations are
very rare (actually when they happen a crash follows), it's not a problem
to waste some time isolating the caller there.
This must be backported to 2.7, along with this previous patch:
BUG/MINOR: fd: used the update list from the fd's group instead of tgid
In _fd_delete_orphan() we try to remove the FD from its update list
which is supposed to be the current thread group's. However the function
might be called from another group during stopping or under isolation,
so FD is not queued in the current group's update list but in its own
group's list. Let's retrieve the group from the FD instead of using
tgid.
This should have no impact on existing code since there is no code path
calling fd_delete() under thread isolation for now, and other cases are
blocked in fd_delete().
This must be backported to 2.7.
As for the H1 and H2 stream, the QUIC stream now states it does not expect
data from the server as long as the request is unfinished. The aim is the
same. We must be sure to not trigger a read timeout on server side if the
client is still uploading data.
From the moment the end of the request is received and forwarded to upper
layer, the QUIC stream reports it expects to receive data from the opposite
endpoint. This re-enables read timeout on the server side.
As for the H1 stream, the H2 stream now states it does not expect data from
the server as long as the request is unfinished. The aim is the same. We
must be sure to not trigger a read timeout on server side if the client is
still uploading data.
From the moment the end of the request is received and forwarded to upper
layer, the H2 stream reports it expects to receive data from the opposite
endpoint. This re-enables read timeout on the server side.
On client side, as long as the request is unfinished, the H1 stream states
it does not expect data from the server. It does not mean the server must
not send its response but only it may wait to receive the whole request with
no risk to trigger a read timeout.
When the request is finished, the H1 stream reports it expects to receive
data from the opposite endpoint.
The purpose of this patch is to never report a server timeout on receive if
the client is still uploading data. This way, it is possible to have a
smaller server timeout than the client one.
Instead of reporting a blocked send if nothing is send, we do it if some
output data remain blocked after a write attempts or after a call the the
applet's I/O handler. It is mandatory to properly handle write timeouts.
Indeed, if an endpoint is blocked for a while but it partially consumed
output data, no timeout is triggered. It is especially true for
connections. But the same may happen for applet, there is no reason.
Of course, if the endpoint decides to partially consume output data because
it must wait to move on for any reason, it should use the se/applet API
(se/applet_will_consume(), se/applet_wont_consume() and
se/applet_need_more_data()).
This bug was introduced during the channels timeouts refactoring. No
backport is needed.
When we exit from process_stream(), if the task is expired, we try to handle
the stream timeouts and we resync the stream-connectors. This avoids a
useless immediate wakeup. It is not really an issue, but it is a small
improvement in edge cases.
This will be mandatory to be able to handle stream's timeouts before exiting
process_stream(). So, to not duplicate code, all this stuff is moved in a
dedicated function.
At the end of process_stream(), A BUG_ON was recently added to abort if we
leave the function with an expired task. However, it may happen if an event
prevents the timeout to be handled but nothing evolved. In this case, the
task expiration is not updated and we expect to catch the timeout on the
immediate task wakeup.
No backport needed.
A bug during H1 data parsing may lead to copy more data than the maximum
allowed. The bug is an overflow on this max threshold when it is lower than
the size of an htx_blk structure.
At first glance, it means it is possible to not respsect the buffer's
reserve. So it may lead to rewrite errors but it may also block any progress
on the stream if the compression is enabled. In this case, the channel
buffer appears as full and the compression must wait for space to
proceed. Outside of any bug, it is only possible when there are outgoing
data to forward, so the compression filter just waits. Because of this bug,
there is nothing to forward. The buffer is just full of input data. Thus
nothing move and the stream is infinitly blocked.
To fix the bug, we must be sure to be able to create an HTX block of 1 byte
without exceeding the maximum allowed.
This patch should fix the issue #2053. It must be backported as far as 2.5.
With 4d9888c ("CLEANUP: fd: get rid of the __GET_{NEXT,PREV} macros") some
"volatile" keywords were dropped at various assignment places in fd's code.
In fd_add_to_fd_list() and fd_add_to_fd_list(), because of the absence of
the "volatile" keyword: the compiler was able to perform some code
optimizations that prevented prev and next variables from being reloaded
between locking attempts (goto loop).
The result was that fd_add_to_fd_list() and fd_rm_from_fd_list() could enter in
infinite loops, preventing other threads from working further and ultimately
resulting in the watchdog being triggered as described in GH #2011.
To fix this, we made sure to re-audit 4d9888c in order to restore the required
memory barriers / compilers hints to prevent the compiler from mis-optimizing
the code around the fd's locks.
That is: using atomic loads to fetch the prev and next values, and restoring
the "volatile" cast for cur_list.ent variable assignment in fd_rm_from_fd_list()
Big thanks to @xanaxalan for his help and patience and to @wtarreau for his
findings and explanations in regard to compiler's optimizations.
This must be backported in 2.7 with 4d9888c ("CLEANUP: fd: get rid of the
__GET_{NEXT,PREV} macros")
This patch adds support from HAPROXY_BRANCH environment variable.
It can be useful is some resources are loaded from different
locations when migrating from one version to another.
Signed-off-by: Sébastien Gross <sgross@haproxy.com>
Since the previous patch, the ring's offset is not used anymore. The
haring utility remains backward-compatible since it can trust the
buffer element that's at the beginning of the map and which still
contains all the valid data.
The ring's offset currently contains a perpetually growing custor which
is the number of bytes written from the start. It's used by readers to
know where to (re)start reading from. It was made absolute because both
the head and the tail can change during writes and we needed a fixed
position to know where the reader was attached. But this is complicated,
error-prone, and limits the ability to reduce the lock's coverage. In
fact what is needed is to know where the reader is currently waiting, if
at all. And this location is exactly where it stored its count, so the
absolute position in the buffer (the seek offset from the first storage
byte) does represent exactly this, as it doesn't move (we don't realign
the buffer), and is stable regardless of how head/tail changes with writes.
This patch modifies this so that the application code now uses this
representation instead. The most noticeable change is the initialization,
where we've kept ~0 as a marker to go to the end, and it's now set to
the tail offset instead of trying to resolve the current write offset
against the current ring's position.
The offset was also used at the end of the consuming loop, to detect
if a new write had happened between the lock being released and taken
again, so as to wake the consumer(s) up again. For this we used to
take a copy of the ring->ofs before unlocking and comparing with the
new value read in the next lock. Since it's not possible to write past
the current reader's location, there's no risk of complete rollover, so
it's sufficient to check if the tail has changed.
Note that the change also has an impact on the haring consumer which
needs to adapt as well. But that's good in fact because it will rely
on one less variable, and will use offsets relative to the buffer's
head, and the change remains backward-compatible.
If a ring is resized, we must not zero its head since the contents
are preserved in-situ. Till now it used to work because we only resize
during boot and we emit very few data (if at all) during boot. But this
can change in the future.
This can be backported to 2.2 though no older version should notice a
difference.
This issue arrived with this commit:
1dbeb35f8 MINOR: quic: Add new traces about by connection RX buffer handling
and revealed by the GH CI as follows:
src/quic_conn.c: In function ‘quic_rx_pkts_del’:
include/haproxy/trace.h:134:65: error: format ‘%zu’ expects argument of type ‘size_t’,
but argument 6 has type ‘uint64_t’ {aka ‘long long unsigned int’} [-Werror=format=]
_msg_len = snprintf(_msg, sizeof(_msg), (fmt), ##args);
Replace all %zu printf integer format by %llu.
Must be backported to 2.7 where the previous is supposed to be backported.
In haproxy startup, all init error paths after the protocol bind step
cautiously call protocol_unbind_all() before exiting except one that was
conditional. We're not making an exception to the rule and we now properly
call protocol_unbind_all() as well.
No backport needed as this patch is unnoticeable.
In sock_unix_bind_receiver(), uxst_bind_listener() and uxdg_bind_listener(),
properly dump ABNS socket names by leveraging sa2str() function which does the
hard work for us.
UNIX sockets are reported as is (unchanged) while ABNS UNIX sockets
are prefixed with 'abns@' to match the syntax used in config file.
(they where previously showing as empty strings because of the leading
NULL-byte that was not properly handled in this case)
This is only a minor debug improvement, however it could be useful to
backport it up to 2.4.
[for 2.4: you should replace "%s [%s]" by "%s for [%s]" for uxst and uxgd if
you wan't the patch to apply properly]
When a listener is suspended, we expect that it may not process any data for
the time it is suspended.
Yet for named UNIX socket, as the suspend operation is a no-op at the proto
level, recv events on the socket may still be processed by the polling loop.
This is quite disturbing as someone may rely on a paused proxy being harmless,
which is true for all protos except for named UNIX sockets.
To fix this behavior, we explicitely disable io recv events when suspending a
named UNIX socket listener (we call disable() method on the listener).
The io recv events will automatically be restored when the listener is resumed
since the l->enable() method is called at the end of the resume() operation.
This could be backported up to 2.4 after a reasonable observation
period to make sure that this change doesn't cause unwanted side-effects.
In sock_unix_addrcmp(), named UNIX sockets paths are manually compared in
order to properly handle tempname paths (ending with ".XXXX.tmp") that result
from the 2-step bind implemented in sock_unix_bind_receiver().
However, this logic does not take into account "final" path names (without the
".XXXX.tmp" suffix).
Example:
/tmp/test did not match with /tmp/test.1288.tmp prior to this patch
Indeed, depending on how the socket addr is retrieved, the same socket
could be designated either by its tempname or finalname.
socket addr is normally stored with its finalname within a receiver, but a
call to getsockname() on the same socket will return the tempname that was
used for the bind() call (sock_get_old_sockets() depends on getsockname()).
This causes sock_find_compatible_fd() to malfunction with named UNIX
sockets (ie: haproxy -x CLI option).
To fix this, we slightly modify the check around the temp suffix in
sock_unix_addrcmp(): we perform the suffix check even if one of the
paths is lacking the temp suffix (with proper precautions).
Now the function is able to match:
- finalname x finalname
- tempname x tempname
- finalname x tempname
That is: /tmp/test == /tmp/test.1288.tmp == /tmp/test.X.tmp
It should be backported up to 2.4
In 58651b42f ("MEDIUM: listener/proxy: make the listeners notify about
proxy pause/resume") we introduced the logic for pause/resume notify using
li_ready for pause and li_paused for resume.
Unfortunately, relying on li_paused for resume doesn't work reliably if we
resume a listener which is only made of receivers that are completely stopped.
For example, this could happen with receivers that don't support the
LI_PAUSED state like ABNS sockets.
This is especially true since pause_listener() was renamed to
suspend_listener() to better reflect its actual behavior in
("MINOR: listener: pause_listener() becomes suspend_listener())
To fix this, we now rely on the li_suspended state in resume_listener() to make
sure that suspend_listener() and resume_listener() notify messages are
consistent to each other:
"Proxy pause" is triggered when there are no more ready listeners.
"Proxy resume" is triggered when there are no more suspended listeners.
Also, we make use of the new PR_FL_PAUSED proxy flag to make sure we don't
report the same event twice.
This could be backported up to 2.4 after a reasonable observation
period to make sure that this change doesn't cause unwanted side-effects.
--
Backport notes:
This commit depends on:
- "MINOR: listener: pause_listener() becomes suspend_listener()"
-> 2.4 only, as "MINOR: proxy/listener: support for additional PAUSED state"
was not backported:
Replace this:
|+ if (px && !(px->flags & PR_FL_PAUSED) && !px->li_ready) {
| /* PROXY_LOCK is required */
| proxy_cond_pause(px);
| ha_warning("Paused %s %s.\n", proxy_cap_str(px->cap), px->id);
By this:
|+ if (px && !px->li_ready) {
| ha_warning("Paused %s %s.\n", proxy_cap_str(px->cap), px->id);
| send_log(px, LOG_WARNING, "Paused %s %s.\n", proxy_cap_str(px->cap), px->id);
| }
And this:
|+ if (px && (px->flags & PR_FL_PAUSED) && !px->li_suspended) {
| /* PROXY_LOCK is required */
| proxy_cond_resume(px);
| ha_warning("Resumed %s %s.\n", proxy_cap_str(px->cap), px->id);
By this:
|+ if (px && !px->li_suspended) {
| ha_warning("Resumed %s %s.\n", proxy_cap_str(px->cap), px->id);
| send_log(px, LOG_WARNING, "Resumed %s %s.\n", proxy_cap_str(px->cap), px->id);
| }
We are simply renaming pause_listener() to suspend_listener() to prevent
confusion around listener pausing.
A suspended listener can be in two differents valid states:
- LI_PAUSED: the listener is effectively paused, it will unpause on
resume_listener()
- LI_ASSIGNED (not bound): the listener does not support the LI_PAUSED
state, so it was unbound to satisfy the suspend request, it will
correcly re-bind on resume_listener()
Besides that, we add the LI_F_SUSPENDED flag to mark suspended listeners in
suspend_listener() and unmark them in resume_listener().
We're also adding li_suspend proxy variable to track the number of currently
suspended listeners:
That is, the number of listeners that were suspended through suspend_listener()
and that are either in LI_PAUSED or LI_ASSIGNED state.
Counter is increased on successful suspend in suspend_listener() and it is
decreased on successful resume in resume_listener()
--
Backport notes:
-> 2.4 only, as "MINOR: proxy/listener: support for additional PAUSED state"
was not backported:
Replace this:
| /* PROXY_LOCK is require
| proxy_cond_resume(px);
By this:
| ha_warning("Resumed %s %s.\n", proxy_cap_str(px->cap), px->id);
| send_log(px, LOG_WARNING, "Resumed %s %s.\n", proxy_cap_str(px->cap), px->id);
-> 2.6 and 2.7 only, as "MINOR: listener: make sure we don't pause/resume" was
custom patched:
Replace this:
|@@ -253,6 +253,7 @@ struct listener {
|
| /* listener flags (16 bits) */
| #define LI_F_FINALIZED 0x0001 /* listener made it to the READY||LIMITED||FULL state at least once, may be suspended/resumed safely */
|+#define LI_F_SUSPENDED 0x0002 /* listener has been suspended using suspend_listener(), it is either is LI_PAUSED or LI_ASSIGNED state */
|
| /* Descriptor for a "bind" keyword. The ->parse() function returns 0 in case of
| * success, or a combination of ERR_* flags if an error is encountered. The
By this:
|@@ -222,6 +222,7 @@ struct li_per_thread {
|
| #define LI_F_QUIC_LISTENER 0x00000001 /* listener uses proto quic */
| #define LI_F_FINALIZED 0x00000002 /* listener made it to the READY||LIMITED||FULL state at least once, may be suspended/resumed safely */
|+#define LI_F_SUSPENDED 0x00000004 /* listener has been suspended using suspend_listener(), it is either is LI_PAUSED or LI_ASSIGNED state */
|
| /* The listener will be directly referenced by the fdtab[] which holds its
| * socket. The listener provides the protocol-specific accept() function to
Since fc974887c ("MEDIUM: protocol: explicitly start the receiver before
the listener"), resume from LI_ASSIGNED state does not work anymore.
This is because the binding part has been divided into 2 distinct steps
since: first bind(), then listen().
This new logic was properly implemented in startup sequence
through protocol_bind_all() but wasn't properly reported in
default_resume_listener() function.
Fixing default_resume_listener() to comply with the new logic.
This should help ABNS sockets to properly rebind in resume_listener()
after they have been stopped by pause_listener():
See Redmine:4475 for more context.
This commit depends on:
- "MINOR: listener: workaround for closing a tiny race between resume_listener() and stopping"
- "MINOR: listener: make sure we don't pause/resume bypassed listeners"
This could be backported up to 2.4 after a reasonable observation period to
make sure that this change doesn't cause unwanted side-effects.
In resume_listener(), proto->resume() errors were not properly handled:
the function kept flowing down as if no errors were detected.
Instead, we're performing an early return when such errors are detected to
prevent undefined behaviors.
This could be backported up to 2.4.
--
Backport notes:
This commit depends on:
- "MINOR: listener: make sure we don't pause/resume bypassed listeners"
-> 2.4 ... 2.7:
Replace this:
| if (l->bind_conf->maxconn && l->nbconn >= l->bind_conf->maxconn) {
| l->rx.proto->disable(l);
By this:
| if (l->maxconn && l->nbconn >= l->maxconn) {
| l->rx.proto->disable(l);
Legacy suspend() return value handling in pause_listener() has been altered
over the time.
First with fb76bd5ca ("BUG/MEDIUM: listeners: correctly report pause() errors")
Then with e03204c8e ("MEDIUM: listeners: implement protocol level
->suspend/resume() calls")
We aim to restore original function behavior and comply with resume_listener()
function description.
This is required for resume_listener() and pause_listener() to work as a whole
Now, it is made explicit that pause_listener() may stop a listener if the
listener doesn't support the LI_PAUSED state (depending on the protocol
family, ie: ABNS sockets), in this case l->state will be set to LI_ASSIGNED
and this won't be considered as an error.
This could be backported up to 2.4 after a reasonable observation period
to make sure that this change doesn't cause unwanted side-effects.
--
Backport notes:
This commit depends on:
- "MINOR: listener: make sure we don't pause/resume bypassed listeners"
-> 2.4: manual change required because "MINOR: proxy/listener: support
for additional PAUSED state" was not backported: the contextual patch
lines don't match.
Replace this:
| if (px && !px->li_ready) {
| /* PROXY_LOCK is required */
By this:
| if (px && !px->li_ready) {
| ha_warning("Paused %s %s.\n", proxy_cap_str(px->cap), px->id);
Some listeners are kept in LI_ASSIGNED state but are not supposed to be
started since they were bypassed on initial startup (eg: in protocol_bind_all()
or in enable_listener()...)
Introduce the LI_F_FINALIZED flag: when the variable is non
zero it means that the listener made it past the LI_LISTEN state (finalized)
at least once so we can safely pause / resume. This way we won't risk starting
a previously bypassed listener which never made it that far and thus was not
expected to be lazy-started by accident.
As listener_pause() and listener_resume() are currently partially broken, such
unexpected lazy-start won't happen. But we're trying to restore pause() and
resume() behavior so this patch will be required before going any further.
We had to re-introduce listeners 'flags' struct member since it was recently
moved into bind_conf struct. But here we do have a legitimate need for these
listener-only flags.
This should only be backported if explicitly required by another commit.
--
Backport notes:
-> 2.4 and 2.5:
The 2-bytes hole we're using in the current patch does not apply, let's
use the 4-byte hole located under the 'option' field.
Replace this:
|@@ -226,7 +226,8 @@ struct li_per_thread {
| struct listener {
| enum obj_type obj_type; /* object type = OBJ_TYPE_LISTENER */
| enum li_state state; /* state: NEW, INIT, ASSIGNED, LISTEN, READY, FULL */
|- /* 2-byte hole here */
|+ uint16_t flags; /* listener flags: LI_F_* */
| int luid; /* listener universally unique ID, used for SNMP */
| int nbconn; /* current number of connections on this listener */
| unsigned int thr_idx; /* thread indexes for queue distribution : (t2<<16)+t1 */
By this:
|@@ -209,6 +209,8 @@ struct listener {
| short int nice; /* nice value to assign to the instantiated tasks */
| int luid; /* listener universally unique ID, used for SNMP */
| int options; /* socket options : LI_O_* */
|+ uint16_t flags; /* listener flags: LI_F_* */
|+ /* 2-bytes hole here */
| __decl_thread(HA_RWLOCK_T lock);
|
| struct fe_counters *counters; /* statistics counters */
-> 2.4 only:
We need to adjust some contextual lines.
Replace this:
|@@ -477,7 +478,7 @@ int pause_listener(struct listener *l, int lpx, int lli)
| if (!lli)
| HA_RWLOCK_WRLOCK(LISTENER_LOCK, &l->lock);
|
|- if (l->state <= LI_PAUSED)
|+ if (!(l->flags & LI_F_FINALIZED) || l->state <= LI_PAUSED)
| goto end;
|
| if (l->rx.proto->suspend)
By this:
|@@ -477,7 +478,7 @@ int pause_listener(struct listener *l, int lpx, int lli)
| !(proc_mask(l->rx.settings->bind_proc) & pid_bit))
| goto end;
|
|- if (l->state <= LI_PAUSED)
|+ if (!(l->flags & LI_F_FINALIZED) || l->state <= LI_PAUSED)
| goto end;
|
| if (l->rx.proto->suspend)
And this:
|@@ -535,7 +536,7 @@ int resume_listener(struct listener *l, int lpx, int lli)
| if (MT_LIST_INLIST(&l->wait_queue))
| goto end;
|
|- if (l->state == LI_READY)
|+ if (!(l->flags & LI_F_FINALIZED) || l->state == LI_READY)
| goto end;
|
| if (l->rx.proto->resume)
By this:
|@@ -535,7 +536,7 @@ int resume_listener(struct listener *l, int lpx, int lli)
| !(proc_mask(l->rx.settings->bind_proc) & pid_bit))
| goto end;
|
|- if (l->state == LI_READY)
|+ if (!(l->flags & LI_F_FINALIZED) || l->state == LI_READY)
| goto end;
|
| if (l->rx.proto->resume)
-> 2.6 and 2.7 only:
struct listener 'flags' member still exists, let's use it.
Remove this from the current patch:
|@@ -226,7 +226,8 @@ struct li_per_thread {
| struct listener {
| enum obj_type obj_type; /* object type = OBJ_TYPE_LISTENER */
| enum li_state state; /* state: NEW, INIT, ASSIGNED, LISTEN, READY, FULL */
|- /* 2-byte hole here */
|+ uint16_t flags; /* listener flags: LI_F_* */
| int luid; /* listener universally unique ID, used for SNMP */
| int nbconn; /* current number of connections on this listener */
| unsigned int thr_idx; /* thread indexes for queue distribution : (t2<<16)+t1 */
Then, replace this:
|@@ -251,6 +250,9 @@ struct listener {
| EXTRA_COUNTERS(extra_counters);
| };
|
|+/* listener flags (16 bits) */
|+#define LI_F_FINALIZED 0x0001 /* listener made it to the READY||LIMITED||FULL state at least once, may be suspended/resumed safely */
|+
| /* Descriptor for a "bind" keyword. The ->parse() function returns 0 in case of
| * success, or a combination of ERR_* flags if an error is encountered. The
| * function pointer can be NULL if not implemented. The function also has an
By this:
|@@ -221,6 +221,7 @@ struct li_per_thread {
| };
|
| #define LI_F_QUIC_LISTENER 0x00000001 /* listener uses proto quic */
|+#define LI_F_FINALIZED 0x00000002 /* listener made it to the READY||LIMITED||FULL state at least once, may be suspended/resumed safely */
|
| /* The listener will be directly referenced by the fdtab[] which holds its
| * socket. The listener provides the protocol-specific accept() function to
This is an alternative fix that tries to address the same issue as
d1ebee177 ("BUG/MINOR: listener: close tiny race between
resume_listener() and stopping") while allowing resume_listener() to be
more versatile.
Indeed, because of the previous fix, resume_listener() is not able to
rebind stopped listeners, and this breaks the original behavior that is
documented in the function description:
"If the listener was only in the assigned
state, it's totally rebound. This can happen if a pause() has completely
stopped it. If the resume fails, 0 is returned and an error might be
displayed."
With relax_listener(), we now make sure to check l->state under the
listener lock so we don't call resume_listener() when the conditions are not
met.
As such, concurrently stopped listeners may not be rebound using
relax_listener().
Note: the documented race can't happen since 1b927eb3c ("MEDIUM: proto: stop
protocols under thread isolation during soft stop"), but older versions are
concerned as 1b927eb3c was not marked for backports.
Moreover, the patch also prevents the race between protocol_pause_all() and
resuming from LIMITED or FULL states.
This commit depends on:
- "MINOR: listener: add relax_listener() function"
This should be backported with d1ebee177 up to 2.4
(d1ebee177 is marked to be backported for all stable versions but the current
patch does not apply for versions < 2.4)
There is a need for a small difference between resuming and relaxing
a listener.
When resuming, we expect that the listener may completely resume, this includes
unpausing or rebinding if required.
Resuming a listener is a best-effort operation: no matter the current state,
try our best to bring the listener up to the LI_READY state.
There are some cases where we only want to "relax" listeners that were
previously restricted using limit_listener() or listener_full() functions.
Here we don't want to ressucitate listeners, we're simply interested in
cancelling out the previous restriction.
To this day, listener_resume() on a unbound listener is broken, that's why
the need for this wasn't felt yet.
But we're trying to restore historical listener_resume() behavior, so we better
prepare for this by introducing an explicit relax_listener() function that
only does what is expected in such cases.
This commit depends on:
- "MINOR: listener/api: add lli hint to listener functions"
Add listener lock hint (AKA lli) to (stop/resume/pause)_listener() functions.
All these functions implicitely take the listener lock when they are called:
It could be useful to be able to call them while already holding the lock, so
we're adding lli hint to make them take the lock only when it is missing.
This should only be backported if explicitly required by another commit
--
-> 2.4 and 2.5 common backport notes:
These 2 commits need to be backported first:
- 187396e34 "CLEANUP: listener: function comment typo in stop_listener()"
- a57786e87 "BUG/MINOR: listener: null pointer dereference suspected by
coverity"
-> 2.4 special backport notes:
In addition to the previously mentionned dependencies, the patch needs to be
slightly adapted to match the corresponding contextual lines:
Replace this:
|@@ -471,7 +474,8 @@ int pause_listener(struct listener *l, int lpx)
| if (!lpx && px)
| HA_RWLOCK_WRLOCK(PROXY_LOCK, &px->lock);
|
|- HA_RWLOCK_WRLOCK(LISTENER_LOCK, &l->lock);
|+ if (!lli)
|+ HA_RWLOCK_WRLOCK(LISTENER_LOCK, &l->lock);
|
| if (l->state <= LI_PAUSED)
| goto end;
By this:
|@@ -471,7 +474,8 @@ int pause_listener(struct listener *l, int lpx)
| if (!lpx && px)
| HA_RWLOCK_WRLOCK(PROXY_LOCK, &px->lock);
|
|- HA_RWLOCK_WRLOCK(LISTENER_LOCK, &l->lock);
|+ if (!lli)
|+ HA_RWLOCK_WRLOCK(LISTENER_LOCK, &l->lock);
|
| if ((global.mode & (MODE_DAEMON | MODE_MWORKER)) &&
| !(proc_mask(l->rx.settings->bind_proc) & pid_bit))
Replace this:
|@@ -169,7 +169,7 @@ void protocol_stop_now(void)
| HA_SPIN_LOCK(PROTO_LOCK, &proto_lock);
| list_for_each_entry(proto, &protocols, list) {
| list_for_each_entry_safe(listener, lback, &proto->receivers, rx.proto_list)
|- stop_listener(listener, 0, 1);
|+ stop_listener(listener, 0, 1, 0);
| }
| HA_SPIN_UNLOCK(PROTO_LOCK, &proto_lock);
| }
By this:
|@@ -169,7 +169,7 @@ void protocol_stop_now(void)
| HA_SPIN_LOCK(PROTO_LOCK, &proto_lock);
| list_for_each_entry(proto, &protocols, list) {
| list_for_each_entry_safe(listener, lback, &proto->receivers, rx.proto_list)
| if (!listener->bind_conf->frontend->grace)
|- stop_listener(listener, 0, 1);
|+ stop_listener(listener, 0, 1, 0);
| }
| HA_SPIN_UNLOCK(PROTO_LOCK, &proto_lock);
Replace this:
|@@ -2315,7 +2315,7 @@ void stop_proxy(struct proxy *p)
| HA_RWLOCK_WRLOCK(PROXY_LOCK, &p->lock);
|
| list_for_each_entry(l, &p->conf.listeners, by_fe)
|- stop_listener(l, 1, 0);
|+ stop_listener(l, 1, 0, 0);
|
| if (!(p->flags & (PR_FL_DISABLED|PR_FL_STOPPED)) && !p->li_ready) {
| /* might be just a backend */
By this:
|@@ -2315,7 +2315,7 @@ void stop_proxy(struct proxy *p)
| HA_RWLOCK_WRLOCK(PROXY_LOCK, &p->lock);
|
| list_for_each_entry(l, &p->conf.listeners, by_fe)
|- stop_listener(l, 1, 0);
|+ stop_listener(l, 1, 0, 0);
|
| if (!p->disabled && !p->li_ready) {
| /* might be just a backend */
resume method was not explicitly defined for uxst protocol family.
Here we can safely use the default_resume_listener, just like the uxdg family.
This could be backported up to 2.4.
In protocol_bind_all() (involved in startup sequence):
We only free errmsg (set by fam->bind() attempt) when we make use of it.
But this could lead to some memory leaks because there are some cases
where we ignore the error message (e.g: verbose=0 with ERR_WARN messages).
As long as errmsg is set, we should always free it.
As mentioned earlier, this really is a minor leak because it can only occur on
specific conditions (error paths) during the startup phase.
This may be backported up to 2.4.
--
Backport notes:
-> 2.4 only:
Replace this:
| ha_warning("Binding [%s:%d] for %s %s: %s\n",
| listener->bind_conf->file, listener->bind_conf->line,
| proxy_type_str(px), px->id, errmsg);
By this:
| else if (lerr & ERR_WARN)
| ha_warning("Starting %s %s: %s\n",
| proxy_type_str(px), px->id, errmsg);
In uxst_bind_listener() and uxdg_bind_listener(), when the function
fails because the listener is not bound, both function are setting
the error message but don't set the err status before returning.
Because of this, such error is not properly handled by the upper functions.
Making sure this error is properly catched by returning a composition of
ERR_FATAL and ERR_ALERT.
This could be backported up to 2.4.
The half-closed timeout is now directly retrieved from the proxy
settings. There is no longer usage for the .hcto field in the stconn
structure. So let's remove it.
We now directly use the proxy settings to set the half-close timeout of a
stream-connector. The function sc_set_hcto() must be used to do so. This
timeout is only set when a shutw is performed. So it is not really a big
deal to use a dedicated function to do so.
It was done by hand by callers when a shutdown for read or write was
performed. It is now always handled by the functions performing the
shutdown. This way the callers don't take care of it. This will avoid some
bugs.
Expiration dates in trace messages are now relative to now_ms. It is fairly
easier to read traces this way. And an expired date is now negative. So, it
is also easy to detect when a timeout was reached.
Becasue read and write timeout are now detected using .lra and .fsb fields
of the stream-endpoint descriptor, it is better to also use these fields to
report read and write expiration date in trace messages. Especially because
old rex and wex fields will be removed.
We stop to use the channel's expiration dates to detect read and write
timeouts on the channels. We now rely on the stream-endpoint descriptor to
do so. All the stuff is handled in process_stream().
The stream relies on 2 helper functions to know if the receives or sends may
expire: sc_rcv_may_expire() and sc_snd_may_expire().
An endpoint should now set SE_FL_EXP_NO_DATA flag if it does not expect any
data from the opposite endpoint. This way, the stream will be able to
disable any read timeout on the opposite endpoint. Applets should use
applet_expect_no_data() and applet_expect_data() functions to set or clear
the flag. For now, only dns and sink forwarder applets are concerned.
The stream endpoint descriptor now owns two date, lra (last read activity) and
fsb (first send blocked).
The first one is updated every time a read activity is reported, including data
received from the endpoint, successful connect, end of input and shutdown for
reads. A read activity is also reported when receives are unblocked. It will be
used to detect read timeouts.
The other one is updated when no data can be sent to the endpoint and reset
when some data are sent. It is the date of the first send blocked by the
endpoint. It will be used to detect write timeouts.
Helper functions are added to report read/send activity and to retrieve lra/fsb
date.
Read and write timeouts (.rto and .wto) are now replaced by an unique
timeout, call .ioto. Since the recent refactoring on channel's timeouts,
both use the same value, the client timeout on client side and the server
timeout on the server side. Thus, this part may be simplified. Now it
represents the I/O timeout.
After I/O handling, in sc_notify(), the stream's task is no longer
requeue. The stream may be woken up. But its task is not requeue. It is
useless nowadays and only avoids a call to process_stream() for edge
cases. It is not really a big deal if the stream is woken up for nothing
because its task expired. At worst, it will be responsible to compute its
new expiration date.
These timers are related to the I/O. Thus it is logical to move them into
the SE descriptor. The patch is a bit huge but it is just a
replacement. However it is error-prone.
From the stconn or the stream, helper functions are used to get, set or
reset these timers. This simplify the timers manipulations.
Read and write timeouts concerns the I/O. Thus, it is logical to move it into
the stconn. At the end, the stream is responsible to detect the timeouts. So
it is logcial to have these values in the stconn and not in the SE
descriptor. But it may change depending on the recfactoring.
So, now:
* scf->rto is used instead of req->rto
* scf->wto is used instead of res->wto
* scb->rto is used instead of res->rto
* scb->wto is used instead of req->wto
This patch removes CF_READ_ERROR and CF_WRITE_ERROR flags. We now rely on
SE_FL_ERR_PENDING and SE_FL_ERROR flags. SE_FL_ERR_PENDING is used for write
errors and SE_FL_ERROR for read or unrecoverable errors.
When a connection error is reported, SE_FL_ERROR and SE_FL_EOS are now set and a
read event and a write event are reported to be sure the stream will properly
process the error. At the stream-connector level, it is similar. When an error
is reported during a send, a write event is triggered. On the read side, nothing
more is performed because an error at this stage is enough to wake the stream
up.
A major change is brought with this patch. We stop to check flags of the
ooposite channel to report abort or timeout. It also means when an read or
write error is reported on a side, we no longer update the other side. Thus
a read error on the server side does no long lead to a write error on the
client side. This should ease errors report.
This flag was introduced in 1.3 to fix a design issue. It was untouch since
then but there is no reason to still have this trick. Note it could be good
to review what happens in HTTP with the server is waiting for the end of the
request. It could be good to be sure a client timeout is always reported.
In bb581423b ("BUG/MEDIUM: httpclient/lua: crash when the lua task timeout
before the httpclient"), a new logic was implemented to make sure that
when a lua ctx destroyed, related httpclients are correctly destroyed too
to prevent a such httpclients from being resuscitated on a destroyed lua ctx.
This was implemented by adding a list of httpclients within the lua ctx,
and a new function, hlua_httpclient_destroy_all(), that is called under
hlua_ctx_destroy() and runs through the httpclients list in the lua context
to properly terminate them.
This was done with the assumption that no concurrent Lua garbage collection
cycles could occur on the same ressources, which seems OK since the "lua"
context is about to be freed and is not explicitly being used by other threads.
But when 'lua-load' is used, the main lua stack is shared between multiple
OS threads, which means that all lua ctx in the process are linked to the
same parent stack.
Yet it seems that lua GC, which can be triggered automatically under
lua_resume() or manually through lua_gc(), does not limit itself to the
"coroutine" stack (the stack referenced in lua->T) when performing the cleanup,
but is able to perform some cleanup on the main stack plus coroutines stacks
that were created under the same main stack (via lua_newthread()) as well.
This can be explained by the fact that lua_newthread() coroutines are not meant
to be thread-safe by design.
Source: http://lua-users.org/lists/lua-l/2011-07/msg00072.html (lua co-author)
It did not cause other issues so far because most of the time when using
'lua-load', the global lua lock is taken when performing critical operations
that are known to interfere with the main stack.
But here in hlua_httpclient_destroy_all(), we don't run under the global lock.
Now that we properly understand the issue, the fix is pretty trivial:
We could simply guard the hlua_httpclient_destroy_all() under the global
lua lock, this would work but it could increase the contention over the
global lock.
Instead, we switched 'lua->hc_list' which was introduced with bb581423b
from simple list to mt_list so that concurrent accesses between
hlua_httpclient_destroy_all and hlua_httpclient_gc() are properly handled.
The issue was reported by @Mark11122 on Github #2037.
This must be backported with bb581423b ("BUG/MEDIUM: httpclient/lua: crash
when the lua task timeout before the httpclient") as far as 2.5.
In hlua_httpclient_send(), we replace hc->req.url with a new url.
But we forgot to free the original url that was allocated in
hlua_httpclient_new() or in the previous httpclient_send() call.
Because of this, each httpclient request performed under lua scripts would
result in a small leak. When stress-testing a lua action which uses httpclient,
the leak is clearly visible since we're leaking severals Mbytes per minute.
This bug was discovered by chance when trying to reproduce GH issue #2037.
It must be backported up to 2.5
Before looking for a secondary cache entry for a given request we
checked that the first entry was complete, which might prevent us from
using a valid entry if the first one with the same primary key is not
full yet.
Likewise, if the primary entry is complete but not the secondary entry
we try to use, we might end up using a partial entry from the cache as
a response.
This bug was raised in GitHub #2048.
It can be backported up to branch 2.4.
Since commit cc9bf2e5f "MEDIUM: cache: Change caching conditions"
responses that do not have an explicit expiration time are not cached
anymore. But this mechanism wrongly used the TX_CACHE_IGNORE flag
instead of the TX_CACHEABLE one. The effect this had is that a cacheable
response that corresponded to a request having a "Cache-Control:
no-cache" for instance would not be cached.
Contrary to what was said in the other commit message, the "checkcache"
option should not be impacted by the use of the TX_CACHEABLE flag
instead of the TX_CACHE_IGNORE one. The response is indeed considered as
not cacheable if it has no expiration time, regardless of the presence
of a cookie in the response.
This should fix GitHub issue #2048.
This patch can be backported up to branch 2.4.
HAPROXY_STARTUP_VERSION: contains the version used to start, in
master-worker mode this is the version which was used to start the
master, even after updating the binary and reloading.
This patch could be backported in every version since it is useful when
debugging.
This patch handles the case where the fd could be -1 when proc_self was
lost for some reason (environment variable corrupted or upgrade from < 1.9).
This could result in a out of bound array access fdtab[-1] and would crash.
Must be backported in every maintained versions.
Previous versions ( < 1.9 ) of the master-worker process didn't had the
"HAPROXY_PROCESSES" environment variable which contains the list of
processes, fd etc.
The part which describes the master is created at first startup so if
you started the master with an old version you would never have
it.
Since patch 68836740 ("MINOR: mworker: implement a reload failure
counter"), the failedreloads member of the proc_self structure for the
master is set to 0. However if this structure does not exist, it will
result in a NULL dereference and crash the master.
This patch fixes the issue by creating the proc_self structure for the
master when it does not exist. It also shows a warning which states to
restart the master if that is the case, because we can't guarantee that
it will be working correctly.
This MUST be backported as far as 2.5, and could be backported in every
other stable branches.
When parsing the HAPROXY_PROCESSES environement variable, strtok was
done directly from the ptr resulting from getenv(), which replaces the ;
by \0, showing confusing environment variables when debugging in /proc
or in a corefile.
Example:
(gdb) x/39s *environ
[...]
0x7fff6935af64: "HAPROXY_PROCESSES=|type=w"
0x7fff6935af7e: "fd=3"
0x7fff6935af83: "pid=4444"
0x7fff6935af8d: "rpid=1"
0x7fff6935af94: "reloads=0"
0x7fff6935af9e: "timestamp=1676338060"
0x7fff6935afb3: "id="
0x7fff6935afb7: "version=2.4.0-8076da-1010+11"
This patch fixes the issue by doing a strdup on the variable.
Could be backported in previous versions (mworker_proc_to_env_list
exists since 1.9)
Implement a way to test if some options are enabled at run-time. For now,
following options may be detected:
POLL, EPOLL, KQUEUE, EVPORTS, SPLICE, GETADDRINFO, REUSEPORT,
FAST-FORWARD, SERVER-SSL-VERIFY-NONE
These options are those that can be disabled on the command line. This way
it is possible, from a reg-test for instance, to know if a feature is
supported or not :
feature cmd "$HAPROXY_PROGRAM -cc '!(globa.tune & GTUNE_NO_FAST_FWD)'"
The option was renamed to only permit to disable the fast-forward. First
there is no reason to enable it because it is the default behavior. Then it
introduced a bug because there is no way to be sure the command line has
precedence over the configuration this way. So, the option is now named
"tune.disable-fast-forward" and does not support any argument. And of
course, the commande line option "-dF" has now precedence over the
configuration.
No backport needed.
For server connections, both the frontend and backend were considered to
enable the httpclose option. However, it is ambiguous because on client side
only the frontend is considerd. In addition for 2 frontends, one with the
option enabled and not for the other, the HTTP connection mode may differ
while it is a backend setting.
Thus, now, for the server side, only the backend is considered. Of course,
if the option is set for a listener, the option will be enabled if the
listener is the backend's connection.
We must never exit for the stream processing function with an expired
task. Otherwise, we are pretty sure this will ends with a spinning loop. It
is really better to abort as far as possible and with the original buggy
state. This will ease the debug sessions.
If the TX buffer (->tx.buf) attached to the connection is not drained, there
are chances that this will be detected by qc_txb_release() which triggers
a BUG_ON_HOT() when this is the case as follows
[00|quic|2|c_conn.c:3477] UDP port unreachable : qc@0x5584f18d6d50 pto_count=0 cwnd=6816 ppif=1046 pif=1046
[00|quic|5|ic_conn.c:749] qc_kill_conn(): entering : qc@0x5584f18d6d50
[00|quic|5|ic_conn.c:752] qc_kill_conn(): leaving : qc@0x5584f18d6d50
[00|quic|5|c_conn.c:3532] qc_send_ppkts(): leaving : qc@0x5584f18d6d50 pto_count=0 cwnd=6816 ppif=1046 pif=1046
FATAL: bug condition "buf && b_data(buf)" matched at src/quic_conn.c:3098
Consume the remaining data in the TX buffer calling b_del().
This bug arrived with this commit:
a2c62c314 MINOR: quic: Kill the connections on ICMP (port unreachable) packet receipt
Takes also the opportunity of this patch to modify the comments for qc_send_ppkts()
which should have arrived with a2c62c314 commit.
Must be backported to 2.7 where this latter commit is supposed to be backported.
Traces from this function would miss a TRACE_LEAVE() on the success path,
which had for consequences, 1) that it was difficult to figure where the
function was left, and 2) that we never had the allocated stream ID
clearly visible (actually the one returned by h2c_frt_stream_new() is
the right one but it's not obvious).
This can be backported to 2.7 and 2.6.
Functions which are called with dummy streams pass it down the traces
and that leads to somewhat confusing "h2s=0x1234568(0,IDL)" for example
while the nature of the called function makes this stream useless at that
place. Better not report a random pointer, especially since it always
requires to look at the code before remembering how this should be
interpreted.
Now what we're doing is that the idle stream only prints "h2s=IDL" which
is shorter and doesn't report a pointer, closed stream do not report
anything since the stream ID 0 already implies it, and other ones are
reported normally.
This could be backported to 2.7 and 2.6 as it improves traces legibility.
With previous commit, quic-conn are now handled as jobs to prevent the
termination of haproxy process. This ensures that QUIC connections are
closed when all data are acknowledged by the client and there is no more
active streams.
The quic-conn layer emits a CONNECTION_CLOSE once the MUX has been
released and all streams are acknowledged. Then, the timer is scheduled
to definitely free the connection after the idle timeout period. This
allows to treat late-arriving packets.
Adjust this procedure to deactivate this timer when process stopping is
in progress. In this case, quic-conn timer is set to expire immediately
to free the quic-conn instance as soon as possible. This allows to
quickly close haproxy process.
This should be backported up to 2.7.
To prevent data loss for QUIC connections, haproxy global variable jobs
is incremented each time a quic-conn socket is allocated. This allows
the QUIC connection to terminate all its transfer operation during proxy
soft-stop. Without this patch, the process will be terminated without
waiting for QUIC connections.
Note that this is done in qc_alloc_fd(). This means only QUIC connection
with their owned socket will properly support soft-stop. In the other
case, the connection will be interrupted abruptly as before. Similarly,
jobs decrement is conducted in qc_release_fd().
This should be backported up to 2.7.
Properly implement support for haproxy soft-stop on QUIC MUX. This code
is similar to H2 MUX :
* on timeout refresh, if stop-stop in progress, schedule the timeout to
expire with regards to the close-spread-end window.
* after input/output processing, if soft-stop in progress, shutdown the
connection. This is randomly spread by close-spread-end window. In the
case of H3 connection, a GOAWAY is emitted and the connection is kept
until all data are sent for opened streams. If the client tries to use
new streams, they are rejected in conformance with the GOAWAY
specification.
This ensures that MUX is able to forward all content properly before
closing the connection. The lower quic-conn layer is then responsible
for retransmission and should be closed when all data are acknowledged.
This will be implemented in the next commit to fully support soft-stop
for QUIC connections.
This should be backported up to 2.7.
Implement client-fin timeout for MUX quic. This timeout is used once an
applicative layer shutdown has been called. In HTTP/3, this corresponds
to the emission of a GOAWAY.
This should be backported up to 2.7.
Define a new function qc_process(). This function will regroup several
internal operation which should be called both on I/O tasklet and wake()
callback. For the moment, only streams purge is conducted there.
This patch is useful to support haproxy soft stop. This should be
backported up to 2.7.
Factorize shutdown operation in a dedicated function qc_shutdown(). This
will allow to call it from multiple places. A new flag QC_CF_APP_SHUT is
also defined to ensure it will only be executed once even if called
multiple times per connection.
This commit will be useful to properly support haproxy soft stop.
This should be backported up to 2.7.
When a GOAWAY has been emitted, an ID is announced to represent handled
streams. H3 RFC suggests that higher streams should be resetted with the
error code H3_REQUEST_CANCELLED. This allows the peer to replay requests
on another connection.
For the moment, the impact of this change is limitted as GOAWAY is only
used on connection shutdown just before the MUX is freed. However, for
soft-stop support, a GOAWAY can be emitted in anticipation while keeping
the MUX to finish the active streams. In this case, new streams opened
by the client are resetted.
As a consequence of this change, app_ops.attach() operation has been
delayed at the very end of qcs_new(). This ensure that all qcs members
are initialized to support RESET_STREAM sending.
This should be backported up to 2.7.
h3s stores the current demux frame type and length as a state info. It
should be big enough to store a QUIC variable-length integer which is
the maximum H3 frame type and size.
Without this patch, there is a risk of integer overflow if H3 frame size
is bigger than INT_MAX. This can typically causes demux state mismatch
and demux frame error. However, no occurence has been found yet of this
bug with the current implementation.
This should be backported up to 2.6.
When the MUX is freed, the quic-conn layer may stay active until all
streams acknowledgment are processed. In this interval, if a new stream
is opened by the client, the quic-conn is thus now responsible to handle
it. This is done by the emission of a STOP_SENDING + RESET_STREAM.
Prior to this patch, the received packet was not acknowledged. This is
undesirable if the quic-conn is able to properly reject the request as
this can lead to unneeded retransmission from the client.
This must be backported up to 2.6.
When the MUX is freed, the quic-conn layer may stay active until all
streams acknowledgment are processed. In this interval, if a new stream
is opened by the client, the quic-conn is thus now responsible to handle
it. This is done by the emission of a STOP_SENDING.
This process has been completed to also emit a RESET_STREAM with the
same error code H3_REQUEST_REJECTED. This is done to conform with the H3
specification to invite the client to retry its request on a new
connection.
This should be backported up to 2.6.
When the MUX is freed, the quic-conn layer may stay active until all
streams acknowledgment are processed. In this interval, if a new stream
is opened by the client, the quic-conn is thus now responsible to handle
it. This is done by the emission of a STOP_SENDING.
This process is closely related to HTTP/3 protocol despite being handled
by the quic-conn layer. This highlights a flaw in our QUIC architecture
which should be adjusted. To reflect this situation, the function
qc_stop_sending_frm_enqueue() is renamed qc_h3_request_reject(). Also,
internal H3 treatment such as uni-directional bypass has been moved
inside the function.
This commit is only a refactor. However, bug fix on next patches will
rely on it so it should be backported up to 2.6.
This was revealed by Amaury when setting tune.quic.frontend.max-streams-bidi to 8
and asking a client to open 12 streams. haproxy has to send short packets
with little MAX_STREAMS frames encoded with 2 bytes. In addition to a packet number
encoded with only one byte. In the case <len_frms> is the length of the encoded
frames to be added to the packet plus the length of the packet number.
Ensure the length of the packet is at least QUIC_PACKET_PN_MAXLEN adding a PADDING
frame wich (QUIC_PACKET_PN_MAXLEN - <len_frms>) as size. For instance with
a two bytes MAX_STREAMS frames and a one byte packet number length, this adds
one byte of padding.
See https://datatracker.ietf.org/doc/html/rfc9001#name-header-protection-sample.
Must be backported to 2.7 and 2.6.
When receiving an Initial packet a peer must drop it if the datagram is smaller
than 1200. Before this patch, this is the entire datagram which was dropped.
In such a case, drop the packet after having parsed its length.
Must be backported to 2.6 and 2.7
This bug arrives with this commit:
982896961 MINOR: quic: split and rename qc_lstnr_pkt_rcv()
The first block of code consists in possibly setting this variable to true.
But it was already initialized to true before entering this code section.
Should be initialized to false.
Also take the opportunity to remove an unused "err" label.
Must be backported to 2.6 and 2.7.
Before probing the Initial packet number space, verify that we can at least
sent 1200 bytes by datagram. This may not be the case due to the amplification limit.
Must be backported to 2.6 and 2.7.
This should help in diagnosing issues revealed by the interop runner which counts
the number of handshakes from the number of Initial packets sent by the server.
Must be backported to 2.7.
The aim of this function is to rearm the idle timer. The ->expire
field of the timer task was updated without being requeued.
Some connection could be unexpectedly terminated.
Must be backported to 2.6 and 2.7.
This is very helpful during retranmission when receiving ICMP port unreachable
errors after the peer has left. This is the unique case at prevent where
qc_send_hdshk_pkts() or qc_send_app_probing() may fail (when they call
qc_send_ppkts() which fails with ECONNREFUSED as errno).
Also make the callers qc_dgrams_retransmit() stop their packet process. This
is the case of quic_conn_app_io_cb() and quic_conn_io_cb().
This modifications stops definitively any packet processing when receiving
ICMP port unreachable errors.
Must be backported to 2.7.
The send*() syscall which are responsible of such ICMP packets reception
fails with ECONNREFUSED as errno.
man(7) udp
ECONNREFUSED
No receiver was associated with the destination address.
This might be caused by a previous packet sent over the socket.
We must kill asap the underlying connection.
Must be backported to 2.7.
This code was there because the timer task was not running on the same thread
as the one which parse the QUIC packets. Now that this is no more the case,
we can wake up this task directly.
Must be backported to 2.7.
Move quic_rx_pkts_del() out of quic_conn.h to make it benefit from the TRACE API.
Add traces which already already helped in diagnosing an issue encountered with
ngtcp2 which sent too much 1RTT packets before the handshake completion. This
has been fixed here after having discussed with Tasuhiro on QUIC dev slack:
https://github.com/ngtcp2/ngtcp2/pull/663
Must be backported to 2.7.
Some counters could potentially be incremented even if send*() syscall returned
no error when ret >= 0 and ret != sz. This could be the case for instance if
a first call to send*() returned -1 with errno set to EINTR (or any previous syscall
which set errno to a non-null value) and if the next call to send*() returned
something positive and smaller than <sz>.
Must be backported to 2.7 and 2.6.
Add traces inside h3_decode_qcs(). Every error path has now its
dedicated trace which should simplify debugging. Each early returns has
been converted to a goto invocation.
To complete the demux tracing, demux frame type and length are now
printed using the h3s instance whenever its possible on trace
invocation. A new internal value H3_FT_UNINIT is used as a frame type to
mark demuxing as inactive.
This should be backported up to 2.7.
Since the recent changes on the clocks, now.tv_sec is not to be used
between processes because it's a clock which is local to the process and
does not contain a real unix timestamp. This patch fixes the issue by
using "data.tv_sec" which is the wall clock instead of "now.tv_sec'.
It prevents having incoherent timestamps.
It also introduces some checks on negatives values in order to never
displays a netative value if it was computed from a wrong value set by a
previous haproxy version.
It must be backported as far as 2.0.
Implement support for clients that emit the stream FIN with an empty
STREAM frame. For that, qcc_recv() offset comparison has been adjusted.
If offset has already been received but the FIN bit is now transmitted,
do not skip the rest of the function and call application layer
decode_qcs() callback.
Without this, streams will be kept open forever as HTX EOM is never
transfered to the upper stream layer.
This behavior was observed with mvfst client prior to its patch
38c955a024aba753be8bf50fdeb45fba3ac23cfd
Fix hq-interop (HTTP 0.9 over QUIC)
This notably caused the interop multiplexing test to fail as unclosed
streams on haproxy side prevented the emission of new MAX_STREAMS frame
to the client.
This shoud be backported up to 2.6. It also relies on previous commit :
381d8137e3
MINOR: h3/hq-interop: handle no data in decode_qcs() with FIN set
Properly handle a STREAM frame with no data but the FIN bit set at the
application layer. H3 and hq-interop decode_qcs() callback have been
adjusted to not return early in this case.
If the FIN bit is accepted, a HTX EOM must be inserted for the upper
stream layer. If the FIN is rejected because the stream cannot be
closed, a proper CONNECTION_CLOSE error will be triggered.
A new utility function qcs_http_handle_standalone_fin() has been
implemented in the qmux_http module. This allows to simply add the HTX
EOM on qcs HTX buffer. If the HTX buffer is empty, a EOT is first added
to ensure it will be transmitted above.
This commit will allow to properly handle FIN notify through an empty
STREAM frame. However, it is not sufficient as currently qcc_recv() skip
the decode_qcs() invocation when the offset is already received. This
will be fixed in the next commit.
This should be backported up to 2.6 along with the next patch.
Several times during debugging it has been difficult to find a way to
reliably indicate if a thread had been started and if it was still
running. It's really not easy because the elements we look at are not
necessarily reliable (e.g. harmless bit or idle bit might not reflect
what we think during a signal). And such notions can be subjective
anyway.
Here we define two thread flags, TH_FL_STARTED which is set as soon as
a thread enters run_thread_poll_loop() and drops the idle bit, and
another one, TH_FL_IN_LOOP, which is set when entering run_poll_loop()
and cleared when leaving it. This should help init/deinit code know
whether it's called from a non-initialized thread (i.e. tid must not
be trusted), or shared functions know if they're being called from a
running thread or from init/deinit code outside of the polling loop.
As reported in github issue #1881, there are situations where an excess
of TLS handshakes can cause a livelock. What's happening is that normally
we process at most one TLS handshake per loop iteration to maintain the
latency low. This is done by tagging them with TASK_HEAVY, queuing these
tasklets in the TL_HEAVY queue. But if something slows down the loop, such
as a connect() call when no more ports are available, we could end up
processing no more than a few hundred or thousands handshakes per second.
If the llmit becomes lower than the rate of incoming handshakes, we will
accumulate them and at some point users will get impatient and give up or
retry. Then a new problem happens: the queue fills up with even more
handshake attempts, only one of which will be handled per iteration, so
we can end up processing only outdated handshakes at a low rate, with
basically nothing else in the queue. This can for example happen in
parallel with health checks that don't require incoming handshakes to
succeed to continue to cause some activity that could maintain the high
latency stuff active.
Here we're taking a slightly different approach. First, instead of always
allowing only one handshake per loop (and usually it's critical for
latency), we take the current situation into account:
- if configured with tune.sched.low-latency, the limit remains 1
- if there are other non-heavy tasks, we set the limit to 1 + one
per 1024 tasks, so that a heavily loaded queue of 4k handshakes
per thread will be able to drain them at ~4 per loops with a
limited impact on latency
- if there are no other tasks, the limit grows to 1 + one per 128
tasks, so that a heavily loaded queue of 4k handshakes per thread
will be able to drain them at ~32 per loop with still a very
limited impact on latency since only I/O will get delayed.
It was verified on a 56-core Xeon-8480 that this did not degrade the
latency; all requests remained below 1ms end-to-end in full close+
handshake, and even 500us under low-lat + busy-polling.
This must be backported to 2.4.
There's a per-thread "long_rq" counter that is used to indicate how
often we leave the scheduler with tasks still present in the run queue.
The purpose is to know when tune.runqueue-depth served to limit latency,
due to a large number of tasks being runnable at once.
However there's a bug there, it's not always set: if after the first
run, one heavy task was processed and later only heavy tasks remain,
we'll loop back to not_done_yet where we try to pick more tasks, but
none are eligible (since heavy ones have already run) so we directly
return without incrementing the counter. This is what causes ultra-low
values on long_rq during massive SSL handshakes, that are confusing
because they make one believe that tl_class_mask doesn't have the HEAVY
flag anymore. Let's just fix that by not returning from the middle of
the function.
This can be backported as far as 2.4.
In 2.7, the method used to check for a sleeping thread changed with
commit e7475c8e7 ("MEDIUM: tasks/fd: replace sleeping_thread_mask with
a TH_FL_SLEEPING flag"). Previously there was a global sleeping mask
and now there is a flag per thread. The commit above partially broke
the watchdog by looking at the current thread's flags via th_ctx
instead of the reported thread's flags, and using an AND condition
instead of an OR to update and leave. This can cause a wrong thread
to be killed when the load is uneven. For example, when enabling
busy polling and sending traffic over a single connection, all
threads have their run time grow, and if the one receiving the
signal is also processing some traffic, it will not match the
sleeping/harmless condition and will set the stuck flag, then die
upon next invocation. While it's reproducible in tests, it's unlikely
to be met in field.
This fix should be backported to 2.7.
The -dF option can now be used to disable data fast-forward. It does the
same than the global option "tune.fast-forward off". Some reg-tests may rely
on this optim. To detect the feature and skip such script, the following
vtest command must be used:
feature cmd "$HAPROXY_PROGRAM -cc '!(globa.tune & GTUNE_NO_FAST_FWD)'"
The new global option "tune.fast-forward" can be set to "off" to disable the
data fast-forward. It is an debug option, thus it is internally marked as
experimental. The directive "expose-experimental-directives" must be set
first to use this one. By default, the data fast-forward is enable.
It could be usefull to force to wake the stream up when data are
received. To be sure, evreything works fine in this case. The data
fast-forward is an optim. It must work without it. But some code may rely on
the fact the stream will not be woken up. With this option, it is possible
to spot some hidden bugs.
At the stream level, the read expiration date is unset if a shutr was
received but not if the end of input was reached. If we know no more data
are excpected, there is no reason to let the read expiration date armed,
except to respect clientfin/serverfin timeout on some circumstances.
This patch could slowly be backported as far as 2.2.
During the payload forwarding, since the commit f2b02cfd9 ("MAJOR: http-ana:
Review error handling during HTTP payload forwarding"), when an error
occurred on one side, we don't rely anymore on a specific HTTP message state
to detect it on the other side. However, nothing was added to detect the
error. Thus, when this happens, a spinning loop may be experienced and an
abort because of the watchdog.
To fix the bug, we must detect the opposite side is closed by checking the
opposite SC state. Concretly, in http_end_request() and http_end_response(),
we wait for the other side iff the HTTP message state is lower to
HTTP_MSG_DONE (the message is not finished) and the SC state is not
SC_ST_CLO (the opposite side is not closed). In these function, we don't
care if there was an error on the opposite side. We only take care to detect
when we must stop waiting the other side.
This patch should fix the issue #2042. No backport needed.
The ssl_bind_kw structure is exclusively used for crt-list keyword, it
must be named otherwise to remove the confusion.
The structure was renamed ssl_crtlist_kws.
The HTTP header parsers surprizingly accepts empty header field names,
and this is a leftover from the original code that was agnostic to this.
When muxes were introduced, for H2 first, the HPACK decompressor needed
to feed headers lists, and since empty header names were strictly
forbidden by the protocol, the lists of headers were purposely designed
to be terminated by an empty header field name (a principle that is
similar to H1's empty line termination). This principle was preserved
and generalized to other protocols migrated to muxes (H1/FCGI/H3 etc)
without anyone ever noticing that the H1 parser was still able to deliver
empty header field names to this list. In addition to this it turns out
that the HPACK decompressor, despite a comment in the code, may
successfully decompress an empty header field name, and this mistake
was propagated to the QPACK decompressor as well.
The impact is that an empty header field name may be used to truncate
the list of headers and thus make some headers disappear. While for
H2/H3 the impact is limited as haproxy sees a request with missing
headers, and headers are not used to delimit messages, in the case of
HTTP/1, the impact is significant because the presence (and sometimes
contents) of certain sensitive headers is detected during the parsing.
Thus, some of these headers may be seen, marked as present, their value
extracted, but never delivered to upper layers and obviously not
forwarded to the other side either. This can have for consequence that
certain important header fields such as Connection, Upgrade, Host,
Content-length, Transfer-Encoding etc are possibly seen as different
between what haproxy uses to parse/forward/route and what is observed
in http-request rules and of course, forwarded. One direct consequence
is that it is possible to exploit this property in HTTP/1 to make
affected versions of haproxy forward more data than is advertised on
the other side, and bypass some access controls or routing rules by
crafting extraneous requests. Note, however, that responses to such
requests will normally not be passed back to the client, but this can
still cause some harm.
This specific risk can be mostly worked around in configuration using
the following rule that will rely on the bug's impact to precisely
detect the inconsistency between the known body size and the one
expected to be advertised to the server (the rule works from 2.0 to
2.8-dev):
http-request deny if { fc_http_major 1 } !{ req.body_size 0 } !{ req.hdr(content-length) -m found } !{ req.hdr(transfer-encoding) -m found } !{ method CONNECT }
This will exclusively block such carefully crafted requests delivered
over HTTP/1. HTTP/2 and HTTP/3 do not need content-length, and a body
that arrives without being announced with a content-length will be
forwarded using transfer-encoding, hence will not cause discrepancies.
In HAProxy 2.0 in legacy mode ("no option http-use-htx"), this rule will
simply have no effect but will not cause trouble either.
A clean solution would consist in modifying the loops iterating over
these headers lists to check the header name's pointer instead of its
length (since both are zero at the end of the list), but this requires
to touch tens of places and it's very easy to miss one. Functions such
as htx_add_header(), htx_add_trailer(), htx_add_all_headers() would be
good starting points for such a possible future change.
Instead the current fix focuses on blocking empty headers where they
are first inserted, hence in the H1/HPACK/QPACK decoders. One benefit
of the current solution (for H1) is that it allows "show errors" to
report a precise diagnostic when facing such invalid HTTP/1 requests,
with the exact location of the problem and the originating address:
$ printf "GET / HTTP/1.1\r\nHost: localhost\r\n:empty header\r\n\r\n" | nc 0 8001
HTTP/1.1 400 Bad request
Content-length: 90
Cache-Control: no-cache
Connection: close
Content-Type: text/html
<html><body><h1>400 Bad request</h1>
Your browser sent an invalid request.
</body></html>
$ socat /var/run/haproxy.stat <<< "show errors"
Total events captured on [10/Feb/2023:16:29:37.530] : 1
[10/Feb/2023:16:29:34.155] frontend decrypt (#2): invalid request
backend <NONE> (#-1), server <NONE> (#-1), event #0, src 127.0.0.1:31092
buffer starts at 0 (including 0 out), 16334 free,
len 50, wraps at 16336, error at position 33
H1 connection flags 0x00000000, H1 stream flags 0x00000810
H1 msg state MSG_HDR_NAME(17), H1 msg flags 0x00001410
H1 chunk len 0 bytes, H1 body len 0 bytes :
00000 GET / HTTP/1.1\r\n
00016 Host: localhost\r\n
00033 :empty header\r\n
00048 \r\n
I want to address sincere and warm thanks for their great work to the
team composed of the following security researchers who found the issue
together and reported it: Bahruz Jabiyev, Anthony Gavazzi, and Engin
Kirda from Northeastern University, Kaan Onarlioglu from Akamai
Technologies, Adi Peleg and Harvey Tuch from Google. And kudos to Amaury
Denoyelle from HAProxy Technologies for spotting that the HPACK and
QPACK decoders would let this pass despite the comment explicitly
saying otherwise.
This fix must be backported as far as 2.0. The QPACK changes can be
dropped before 2.6. In 2.0 there is also the equivalent code for legacy
mode, which doesn't suffer from the list truncation, but it would better
be fixed regardless.
CVE-2023-25725 was assigned to this issue.
There was a parenthesis placed in the wrong place for a memcmp().
As a consequence, clients could not reuse a UDP address for a new connection.
Must be backported to 2.7.
The commit d5983cef8 ("MINOR: listener: remove the useless ->default_target
field") revealed a bug in the SPOE. No default-target must be defined for
the SPOE agent frontend. SPOE applets are used on the frontend side and a
TCP connection is established on the backend side.
Because of this bug, since the commit above, the stream target is set to the
SPOE applet instead of the backend connection, leading to a spinning loop on
the applet when it is released because are unable to close the backend side.
This patch should fix the issue #2040. It only affects the 2.8-DEV but to
avoid any future bug, it should be backported to all stable versions.
When a client timeout is reported by the H1 mux, it is not an error but an
abort. Thus, H1C_F_ERROR flag must not be set. It is espacially important to
not inhibit the send. Because of this bug, a 408-Request-time-out is
reported in logs but the error message is not sent to the client.
This patch must be backported to 2.7.
We should not report LAST data in log if the response is in TUNNEL mode on
client close/timeout because there is no way to be sure it is the last
data. It means, it can only be reported in DONE, CLOSING or CLOSE states.
No backport needed.
There is already a test on CF_EOI and CF_SHUTR. The last one is always set
when a read error is reported. Thus there is no reason to check
CF_READ_ERROR.
This change was performed on all applet I/O handlers but one was missed. In
the CLI I/O handler used to commit a CA/CRL file, we can remove the test on
CF_WRITE_ERROR because there is already a test on CF_SHUTW.
This has been detected by libasan as follows:
=================================================================
==3170559==ERROR: AddressSanitizer: global-buffer-overflow on address 0x55cf77faad08 at pc 0x55cf77a87370 bp 0x7ffc01bdba70 sp 0x7ffc01bdba68
READ of size 8 at 0x55cf77faad08 thread T0
#0 0x55cf77a8736f in cli_find_kw src/cli.c:335
#1 0x55cf77a8a9bb in cli_parse_request src/cli.c:792
#2 0x55cf77a8c385 in cli_io_handler src/cli.c:1024
#3 0x55cf77d19ca1 in task_run_applet src/applet.c:245
#4 0x55cf77c0b6ba in run_tasks_from_lists src/task.c:634
#5 0x55cf77c0cf16 in process_runnable_tasks src/task.c:861
#6 0x55cf77b48425 in run_poll_loop src/haproxy.c:2934
#7 0x55cf77b491cf in run_thread_poll_loop src/haproxy.c:3127
#8 0x55cf77b4bef2 in main src/haproxy.c:3783
#9 0x7fb8b0693d09 in __libc_start_main ../csu/libc-start.c:308
#10 0x55cf7764f4c9 in _start (/home/flecaille/src/haproxy-untouched/haproxy+0x1914c9)
0x55cf77faad08 is located 0 bytes to the right of global variable 'cli_kws' defined in 'src/quic_conn.c:7834:27' (0x55cf77faaca0) of size 104
SUMMARY: AddressSanitizer: global-buffer-overflow src/cli.c:335 in cli_find_kw
Shadow bytes around the buggy address:
According to cli_find_kw() code and cli_kw_list struct definition, the second
member of this structure ->kw[] must be a null-terminated array.
Add a last element with default initializers to <cli_kws> global variable which
is impacted by this bug.
This bug arrived with this commit:
15c74702d MINOR: quic: implement a basic "show quic" CLI handler
Must be backported to 2.7 where this previous commit has been already
backported.
It is not really a bug because it does not fix any known issue. And it is
flagged as MEDIUM because it is sensitive. But if there are some extra calls
to process_stream(), it can be an issue because, in si_update_rx(), we may
disable reading for the SC when outgoing data are blocked in the input
channel. But it is not really the process_stream() job to take care of
that. This may block data receipt.
It is an old code, mainly here to avoid wakeup in loop on the stats
applet. Today, it seems useless and can lead to bugs. An endpoint is
responsible to block the SC if it waits for some room and the opposite
endpoint is responsible to unblock it when some data are sent. The stream
should not interfere on this part.
This patch could be backported to 2.7 after a period of observation. And it
should only be backported to lower versions if an issue is reported.
For an unknown reason in the change of uptime calculation for the HTML
page didn't make it to commit 6093ba47c ("BUG/MINOR: clock: do not mix
wall-clock and monotonic time in uptime calculation"). Let's address it
as well otherwise the stats page will display an incorrect uptime.
No backport needed unless the patch above is backported.
Uptime calculation for master process was incorrect as it used
<start_date> as its timestamp base time. Fix this by using the scheduler
time <start_time> for this.
The impact of this bug is minor as timestamp base time is only used for
"show proc" CLI output. it was highlighted by the following commit.
which caused a negative value to be displayed for the master process
uptime on "show proc" output.
28360dc53f
MEDIUM: clock: force internal time to wrap early after boot
This should be backported up to 2.0.
Incorrect printf format specifier "%lu" was used on "show quic" handler
for uint64_t. This breaks build on 32-bits architecture. To fix this
portability issue, force an explicit cast to unsigned long long with
"%llu" specifier.
This must be backported up to 2.7.
With a connection, when data are received, if these data are sent to the
opposite side because the fast-forwarding is possible, the stream may be
woken up on some conditions (at the end of sc_app_chk_snd_conn()):
* The channel is shut for write
* The SC is not in the "established" state
* The stream must explicitly be woken up on write and all data was sent
* The connection was just established.
A bug on the last condition was introduced with the commit d89884153
("MEDIUM: channel: Use CF_WRITE_EVENT instead of CF_WRITE_PARTIAL"). The
stream is now woken up on any write events.
This patch fixes this issue and restores the original behavior. No backport
is needed.
Filtering of closing/draining connections on "show quic" was not
properly implemented. This causes the extra argument "all" to display
all connections to be without effect. This patch fixes this and restores
the output of all connections.
This must be backported up to 2.7.
Reduce default "show quic" output by masking connection on
closing/draing state due to a CONNECTION_CLOSE emission/reception. These
connections can still be displayed using the special argument "all".
This should be backported up to 2.7.
Complete "show quic" handler by displaying information about
quic_stream_desc entries. These structures are used to emit stream data
and store them until acknowledgment is received.
This should be backported up to 2.7.
Complete "show quic" handler by displaying various information related
to each encryption level and packet number space. Most notably, ack
ranges and bytes in flight are present to help debug retransmission
issues.
This should be backported up to 2.7.
Complete "show quic" handler by displaying information related to the
quic_conn owned socket. First, the FD is printed, followed by the
address of the local and remote endpoint.
This should be backported up to 2.7.
Complete "show quic" handler. Source and destination CIDs are printed
for every connection. This is complete by a state info to reflect if
handshake is completed and if a CONNECTION_CLOSE has been emitted or
received and the allocation status of the attached MUX. Finally the idle
timer expiration is also printed.
This should be backported up to 2.7.
Implement a basic "show quic" CLI handler. This command will be useful
to display various information on all the active QUIC frontend
connections.
This work is heavily inspired by "show sess". Most notably, a global
list of quic_conn has been introduced to be able to loop over them. This
list is stored per thread in ha_thread_ctx.
Also add three CLI handlers for "show quic" in order to allocate and
free the command context. The dump handler runs on thread isolation.
Each quic_conn is referenced using a back-ref to handle deletion during
handler yielding.
For the moment, only a list of raw quic_conn pointers is displayed. The
handler will be completed over time with more information as needed.
This should be backported up to 2.7.
Commit 0aba11e9e ("MINOR: quic: remove unnecessary quic_session_accept()")
overlooked one problem, in session_accept_fd() at the end, there's a bunch
of FD-specific stuff that either sets up or resets the socket at the TCP
level. The tests are mostly performed for AF_INET/AF_INET6 families but
they're only for one part (i.e. to avoid setting up TCP options on UNIX
sockets). Other pieces continue to configure the socket regardless of its
family. All of this directly acts on the FD, which is not correct since
the FD is not valid here, it corresponds to the QUIC handle. The issue
is much more visible when "option nolinger" is enabled in the frontend,
because the access to fdatb[cfd].state immediately crashes on the first
connection, as can be seen in github issue #2030.
This patch bypasses this setup for FD-less connections, such as QUIC.
However some of them could definitely be relevant to the QUIC stack, or
even to UNIX sockets sometimes. A better long-term solution would consist
in implementing a setsockopt() equivalent at the protocol layer that would
be used to configure the socket, either the FD or the QUIC conn depending
on the case. Some of them would not always be implemented but that would
allow to unify all this code.
This fix must be backported everywhere the commit above is backported,
namely 2.6 and 2.7.
Thanks to github user @twomoses for the nicely detailed report.
The commit 7f59d68fe ("BUG/MEDIIM: stconn: Flush output data before
forwarding close to write side") introduced a regression. When the read side
is closed, the close is not forwarded to the write side if there are some
pending outgoind data. The idea is to foward data first and the close the
write side. However, when fast-forwarding is enabled and last data block is
received with the read0, the close is never forwarded.
We cannot revert the commit above because it really fix an issue. However,
we can schedule the shutdown for write by setting CF_SHUTW_NOW flag on the
write side. Indeed, it is the purpose of this flag.
To not replicate ugly and hardly maintainable code block at different places
in stconn.c, an helper function is used. Thus, sc_cond_forward_shutw() must
be called to know if the close can be fowarded or not. It returns 1 if it is
possible. In this case, the caller is responsible to forward the close to
the write side. Otherwise, if the close cannot be forwarded, 0 is
returned. It happens when it should not be performed at all. Or when it
should only be delayed, waiting for the input channel to be flushed. In this
last case, the CF_SHUTW_NOW flag is set in the output channel.
This patch should fix the issue #2033. It must be backported with the commit
above, thus at least as far as 2.2.
When a new server was added through the cli using "server add" command,
the maxconn/minconn consistency check historically implemented in
check_config_validity() for static servers was missing.
As a result, when adding a server with the maxconn parameter without the
minconn set, the server was unable to handle any connection because
srv_dynamic_maxconn() would always return 0.
Consider the following reproducer:
| global
| stats socket /tmp/ha.sock mode 660 level admin expose-fd listeners
|
| defaults
| timeout client 5s
| timeout server 5s
| timeout connect 5s
|
| frontend test
| mode http
| bind *:8081
| use_backend farm
|
| listen dummyok
| bind localhost:18999
| mode http
| http-request return status 200 hdr test "ok"
|
| backend farm
| mode http
Start haproxy and perform the following :
echo "add server farm/t1 127.0.0.1:18999 maxconn 100" | nc -U /tmp/ha.sock
echo "enable server farm/t1" | nc -U /tmp/ha.sock
curl localhost:8081 # -> 503 after 5s connect timeout
Thanks to ("MINOR: cfgparse/server: move (min/max)conn postparsing logic into
dedicated function"), we are now able to perform the consistency check after
the new dynamic server has been parsed.
This is enough to fix the issue documented here that was reported by
Thomas Pedoussaut on the ML.
This commit depends on:
- ("MINOR: cfgparse/server: move (min/max)conn postparsing logic into
dedicated function")
It must be backported to 2.6 and 2.7
In check_config_validity() function, we performed some consistency checks to
adjust minconn/maxconn attributes for each declared server.
We move this logic into a dedicated function named srv_minmax_conn_apply()
to be able to perform those checks later in the process life when needed
(ie: dynamic servers)
Deduplicate the code which checks the OCSP update in the ckch_store and
in the crtlist_entry.
Also, jump immediatly to error handling when the ERR_FATAL is catched.
GH issue #2034 clearly indicates yet another case of time roll-over
that went badly. Issues that happen only once every 50 days are hard
to detect and debug, and are usually reported more or less synchronized
from multiple sources. This patch finally does what had long been planned
but never done yet, which is to force the time to wrap early after boot
so that any such remaining issue can be spotted quicker. The margin delay
here is 20s (it may be changed by setting BOOT_TIME_WRAP_SEC to another
value). This value seems sufficient to permit failed health checks to
succeed and traffic to come in and possibly start to update some time
stamps (accept dates in logs, freq counters, stick-tables expiration
dates etc).
It could theoretically be helpful to have this in 2.7, but as can be
seen with the two patches below, we've already had incorrect use cases
of the internal monotonic time when the wall-clock one was needed, so
we could expect to detect other ones in the future. Note that this will
*not* induce bugs, it will only make them happen much faster (i.e. no
need to wait for 50 days before seeing them). If it were to eventually
be backported, these two previous patches must also be backported:
BUG/MINOR: clock: use distinct wall-clock and monotonic start dates
BUG/MEDIUM: cache: use the correct time reference when comparing dates
The cache makes use of dates advertised by external components, such
as "last-modified" or "date". As such these are wall-clock dates, and
not internal dates. However, all comparisons are mistakenly made based
on the internal monotonic date which is designed to drift from the wall
clock one in order to catch up with stolen time (which can sometimes be
intense in VMs). As such after some run time some objects may fail to
validate or fail to expire depending on the direction of the drift. This
is particularly visible when applying an offset to the internal time to
force it to wrap soon after startup, as it will be shifted up to 49.7
days in the future depending on the current date; what happens in this
case is that the reg-test "cache_expires.vtc" fails on the 3rd test by
returning stale contents from the cache at the date of this commit.
It is really important that all external dates are compared against
"date" and not "now" for this reason.
This fix needs to be backported to all versions.
We've had a start date even before the internal monotonic clock existed,
but once the monotonic clock was added, the start date was not updated
to distinguish the wall clock time units and the internal monotonic time
units. The distinction is important because both clocks do not necessarily
progress at the same speed. The very rare occurrences of the wall-clock
date are essentially for human consumption and communication with third
parties (e.g. report the start date in "show info" for monitoring
purposes). However currently this one is also used to measure the distance
to "now" as being the process' uptime. This is actually not correct. It
only works because for now the two dates are initialized at the exact
same instant at boot but could still be wrong if the system's date shows
a big jump backwards during startup for example. In addition the current
situation prevents us from enforcing an abritrary offset at boot to reveal
some heisenbugs.
This patch adds a new "start_time" at boot that is set from "now" and is
used in uptime calculations. "start_date" instead is now set from "date"
and will always reflect the system date for human consumption (e.g. in
"show info"). This way we're now sure that any drift of the internal
clock relative to the system date will not impact the reported uptime.
This could possibly be backported though it's unlikely that anyone has
ever noticed the problem.
At some moments expired stick table records stop being removed. This
happens when the internal time wraps around the 32-bit limit, or every
49.7 days. What precisely happens is that some elements that are collected
close to the end of the time window (2^32 - table's "expire" setting)
might have been updated and will be requeued further, at the beginning
of the next window. Here, three bad situations happen:
- the incorrect integer-based comparison that is not aware of wrapping
will result in the scan to restart from the freshly requeued element,
skipping all those at the end of the window. The net effect of this
is that at each wakeup of the expiration task, only one element from
the end of the window will be expired, and other ones will remain
there for a very long time, especially if they have to wait for all
the predecessors to be picked one at a time after slow wakeups due
to a long expiration ; this is what was observed in issue #2034
making the table fill up and appear as not expiring at all, and it
seems that issue #2024 reports the same problem at the same moment
(since such issues happen for everyone roughly at the same time
when the clock doesn't drift too much).
- the elements that were placed at the beginning of the next window
are skipped as well for as long as there are refreshed entries at
the end of the previous window, so these ones participate to filling
the table as well. This is cause by the restart from the current,
updated node that is generally placed after most other less recently
updated elements.
- once the last element at the end of the window is picked, suddenly
there is a large amount of expired entries at the beginning of the
next window that all have to be requeued. If the expiration delay
is large, the number can be big and it can take a long time, which
can very likely explain the periodic crashes reported in issue #2025.
Limiting the batch size as done in commit dfe79251d ("BUG/MEDIUM:
stick-table: limit the time spent purging old entries") would make
sense for process_table_expire() as well.
This patch addresses the incorrect tree scan algorithm to make sure that:
- there's always a next element to compare against, even when dealing
with the last one in the tree, the first one must be used ;
- time comparisons used to decide whether to restart from the current
element use tick_is_lt() as it is the only case where we know the
current element will be placed before any other one (since the tree
respects insertion ordering for duplicates)
In order to reproduce the issue, it was found that injecting traffic on
a random key that spans over half of the size of a table whose expiration
is set to 15s while the date is going to wrap in 20s does exhibit an
increase of the table's size 5s after startup, when entries start to be
pushed to the next window. It's more effective when a second load
generator constantly hammers a same key to be certain that none of them
is ready to expire. This doesn't happen anymore after this patch.
This fix needs to be backported to all stable versions. The bug has been
there for as long as the stick tables were introduced in 1.4-dev7 with
commit 3bd697e07 ("[MEDIUM] Add stick table (persistence) management
functions and types"). A cleanup could consists in deduplicating that
code by having process_table_expire() call __stktable_trash_oldest(),
with that one improved to support an optional time check.
Display a warning when some text exists between the filename and the
options. This part is completely ignored so if there are filters here,
they were never parsed.
This could be backported in every versions. In the older versions, the
parsing was done in ssl_sock_load_cert_list_file() in ssl_sock.c.
Aurélien reported that the BUG_ON(!new_ts.nbgrp) added in 2.8-dev3 by
commit 50440457e ("MEDIUM: config: restrict shards, not bind_conf to one
group each") can trigger on some invalid configs where the thread_set on
the "bind" line couldn't be resolved. The reason is that we still enter
the parsing loop (as it was done previously) and we possibly have no
group to work on (which was the purpose of this assertion). There we
need to bypass all this block on such a condition.
No backport is needed.
Aurélien reported a bug making a statement such as "thread 2-2" fail for
a config made of exactly 2 threads. What happens is that the parser for
the "thread" keyword scans a range of thread numbers from either 1..64
or 0,-1,-2 for special values, and presets the bit masks accordingly in
the thread set, except that due to the 1..64 range, the shift length must
be reduced by one. Not doing this causes empty masks for single-bit values
that are exactly equal to the number of threads in the group and fails to
properly parse.
No backport is needed as this was introduced in 2.8-dev3 by commit
bef43dfa6 ("MINOR: thread: add a simple thread_set API").
Due to multithreading concurrency, it is difficult at this time to figure
out how this counter may become negative. This simple patch only checks this
will never be the case.
This issue arrives with this commit:
"9969adbcdc MINOR: stats: add by HTTP version cumulated number of sessions and requests"
So, this patch should be backported when the latter has been backported.
When stats_putchk() fails to peform the dump because available data space in
htx is less than the number of bytes pending in the dump buffer, we wait
for more room in the htx (ie: sc_need_room()) to retry the dump attempt
on the next applet invocation.
To provide consistent output, we have to make sure that the stat ctx is not
updated (or at least correctly reverted) in case stats_putchk() fails so
that the new dumping attempt behaves just like the previous (failed) one.
STAT_STARTED is not following this logic, the flag is set in
stats_dump_fields_json() as soon as some data is written to the output buffer.
It's done too early: we need to delay this step after the stats_putchk() has
successfully returned if we want to correctly handle the retries attempts.
Because of this, JSON output could suffer from extraneous ',' characters which
could make json parsers unhappy.
For example, this is the kind of errors you could get when using
`python -m json.tool` on such badly formatted outputs:
"Expecting value: line 1 column 2 (char 1)"
Unfortunately, fixing this means that the flag needs to be enabled at
multiple places, which is what we're doing in this patch.
(in stats_dump_proxy_to_buffer() where stats_dump_one_line() is involved
by underlying stats_dump_{fe,li,sv,be} functions)
Thereby, this raises the need for a cleanup to reduce code duplication around
stats_dump_proxy_to_buffer() function and simplify things a bit.
It could be backported to 2.6 and 2.7
In ("MINOR: stats: introduce stats field ctx"), we forgot
to apply the patch to servers.
This prevents "BUG/MINOR: stats: fix show stat json buffer limitation"
from working with servers dump.
We're adding the missing part related to servers dump.
This commit should be backported with the aforementioned commits.
When ctx->field was introduced with ("MINOR: stats: introduce stats field ctx")
a mistake was made for the STAT_PX_ST_LI state in stats_dump_proxy_to_buffer():
current_field reset is placed after the for loop, ie: after multiple lines
are dumped. Instead it should be placed right after each li line is dumped.
This could cause some output inconsistencies (missing fields), especially when
http dump is used with JSON output and "socket-stats" option is enabled
on the proxy, because when htx is full we restore the ctx->field with
current_field (which contains outdated value in this case).
This should be backported with ("MINOR: stats: introduce stats field ctx")
In ("BUG/MEDIUM: stats: Rely on a local trash buffer to dump the stats"),
we forgot to apply the patch in resolvers.c which provides the
stats_dump_resolvers() function that is involved when dumping with "resolvers"
domain.
As a consequence, resolvers dump was broken because stats_dump_one_line(),
which is used in stats_dump_resolv_to_buffer(), implicitely uses trash_chunk
from stats.c to prepare the dump, and stats_putchk() is then called with
global trash (currently empty) as output data.
Given that trash_dump variable is static and thus only available within stats.c
we change stats_putchk() function prototype so that the function does not take
the output buffer as an argument. Instead, stats_putchk() will implicitly use
the local trash_dump variable declared in stats.c.
It will also prevent further mixups between stats_dump_* functions and
stats_putchk().
This needs to be backported with ("BUG/MEDIUM: stats: Rely on a local trash
buffer to dump the stats")
In ("BUG/MINOR: stats: use proper buffer size for http dump"),
we used trash.size as source buffer size before applying the htx
overhead computation.
It is safer to use res->buf.size instead since res_htx (which is <htx> argument
passed to stats_putchk() in http context) is made from res->buf:
in http_stats_io_handler:
| res_htx = htx_from_buf(&res->buf);
This will prevent the hang bug from showing up again if res->buf.size were to be
less than trash.size (which is set according to tune.bufsize).
This should be backported with ("BUG/MINOR: stats: use proper buffer size for http dump")
When building STREAM frames in a packet buffer, if a frame is too large
it will be splitted in two. A shorten version will be used and the
original frame will be modified to represent the remaining space.
To ensure there is enough space to store the frame data length encoded
as a QUIC integer, we use the function max_available_room(). This
function can return 0 if there not only a small space left which is
insufficient for the frame header and the shorten data. Prior to this
patch, this wasn't check and an empty unneeded STREAM frame was built
and sent for nothing.
Change this by checking the value return by max_available_room(). If 0,
do not try to split this frame and continue to the next ones in the
packet.
On 2.6, this patch serves as an optimization which will prevent the building
of unneeded empty STREAM frames.
On 2.7, this behavior has the side-effect of triggering a BUG_ON()
statement on quic_build_stream_frame(). This BUG_ON() ensures that we do
not use quic_frame with OFF bit set if its offset is 0. This can happens
if the condition defined above is reproduced for a STREAM frame at
offset 0. An empty unneeded frame is built as descibed. The problem is
that the original frame is modified with its OFF bit set even if the
offset is still 0.
This must be backported up to 2.6.
Now that we're using thread_sets there's no need to restrict an entire
bind_conf to 1 group, the real concern being the FD, we can move that
restriction to the shard only. This means that as long as we have enough
shards and that they're properly aligned on group boundaries (i.e. shards
are an integer divider of the number of threads), we can support "bind"
lines spanning more than one group.
The check is still performed for shards to span more than one group,
and an error is emitted when this happens. But at least now it becomes
possible to have this:
global
nbthread 256
frontend foo
bind :1111 shards 4
bind :2222 shards by-thread
Let's now retrieve the first thread group and its mask from the
thread_set so that we don't need these fields in the bind_conf anymore.
For now we're still limited to the first group (like before) but that
allows to get rid of these fields and to make sure that there's nothing
"special" being done there anymore.
Instead of reading and storing a single group and a single mask for a
"thread" directive on a bind line, we now store the complete range in
a thread set that's stored in the bind_conf. The bind_parse_thread()
function now just calls parse_thread_set() to complete the current set,
which starts empty, and thread_resolve_group_mask() was updated to
support retrieving thread group numbers or absolute thread numbers
directly from the pre-filled thread_set, and continue to feed bind_tgroup
and bind_thread. The CLI parsers which were pre-initialized to set the
bind_tgroup to 1 cannot do it anymore as it would prevent one from
restricting the thread set. Instead check_config_validity() now detects
the CLI frontend and passes the info down to thread_resolve_group_mask()
that will automatically use only the group 1's threads for these
listeners. The same is done for the peers listeners for now.
At this step it's already possible to start with all previous valid
configs as well as extended ones supporting comma-delimited thread
sets. In addition the parser already accepts large ranges spanning
multiple groups, but since the underlying listeners infrastructure
is not read, for now we're maintaining a specific check against this
at the higher level of the config validity check.
The patch is a bit large because thread resolution is performed in
multiple steps, so we need to adjust all of them at once to preserve
functional and technical consistency.
The purpose is to be able to store large thread sets, defined by ranges
that may cross group boundaries, as well as define lists of groups and
masks. The thread_set struct implements the storage, and the parser is
in parse_thread_set(), with a focus on "bind" lines, but not only.
During 2.5 development, a fallback was implemented for bind "thread"
directives that would not map to existing threads, with commit e3f4d7496
("MEDIUM: config: resolve relative threads on bind lines to absolute ones").
The approch consisted in remapping the threads to other ones. But now
that relative threads and not absolute threads are stored in this mask,
this case cannot happen anymore, and this confusing hack is not needed
anymore.
This flag is only used to tag a QUIC listener, which we now know by
its bind_conf's xprt as well. It's only used to decide whether or not
to perform an extra initialization step on the listener. Let's drop it
as well as the flags field.
With the various fields and options moved, the listener struct reduced
by 48 bytes total.
LI_O_TCP_L4_RULES and LI_O_TCP_L5_RULES are only set by from the proxy
based on the presence or absence of tcp_req l4/l5 rules. It's basically
as cheap to check the list as it is to check the flag, except that there
is no need to maintain a copy. Let's get rid of them, and this may ease
addition of more dynamic stuff later.
These two flags are entirely for internal use and are even per proxy
in practice since they're used for peers and CLI to indicate (for the
first one) that the listener(s) are not subject to connection limits,
and for the second that the listener(s) should not be stopped on
soft-stop. No need to keep them in the listeners, let's move them to
the bind_conf under names BC_O_UNLIMITED and BC_O_NOSTOP.
These are only set per bind line and used when creating a sessions,
we can move them to the bind_conf under the names BC_O_ACC_PROXY and
BC_O_ACC_CIP respectively.
It's set per bind line ("tfo") and only used in tcp_bind_listener() so
there's no point keeping the address family tests, let's just store the
flag in the bind_conf under the name BC_O_TCP_FO.
This option is set per bind line, and was only set stored when the
address family is AF_INET4 or AF_INET6. That's pointless since it's
used only in tcp_bind_listener() which is only used for such families
as well, so it can now be moved to the bind_conf under the name
BC_O_DEF_ACCEPT.
It's currently declared per-frontend, though it would make sense to
support it per-line but in no case per-listener. Let's move the option
to a bind_conf option BC_O_NOLINGER.
This field is used by stream_new() to optionally set the applet the
stream will connect to for simple proxies like the CLI for example.
But it has never been configurable to anything and is always strictly
equal to the frontend's ->default_target. Let's just drop it and make
stream_new() only use the frontend's. It makes more sense anyway as
we don't want the proxy to work differently based on the "bind" line.
This idea was brought in 1.6 hoping that the h2 implementation would
use applets for decoding (which was dropped after the very first
attempt in 1.8).
The accept callback directly derives from the upper layer, generally
it's session_accept_fd(). As such it's also defined per bind line
so it makes sense to move it there.
The maxconn is set per bind line so let's move it there. This might
possibly even slightly reduce inter-thread contention since this one
is read-mostly and it was stored next to nbconn which changes for
each connection setup or teardown.
Like for previous values, maxaccept is really per-bind_conf, so let's
move it there. Some frontends (peers, log) set it to 1 so the assignment
was slightly moved.
These two arguments were only set and only used with tcpv4/tcpv6. Let's
just store them into the bind_conf instead of duplicating them for all
listeners since they're fixed per "bind" line.
When bind_conf were created, some elements such as the analysers mask
ought to have moved there but that wasn't the case. Now that it's
getting clearer that bind_conf provides all binding parameters and
the listener is essentially a listener on an address, it's starting
to get really confusing to keep such parameters in the listener, so
let's move the mask to the bind_conf. We also take this opportunity
for pre-setting the mask to the frontend's upon initalization. Now
several loops have one less argument to take care of.
The SCID (source connection ID) used by a peer (client or server) is sent into the
long header of a QUIC packet in clear. But it is also sent into the transport
parameters (initial_source_connection_id). As these latter are encrypted into the
packet, one must check that these two pieces of information do not differ
due to a packet header corruption. Furthermore as such a connection is unusuable
it must be killed and must stop as soon as possible processing RX/TX packets.
Implement qc_kill_con() to flag a connection as unusable and to kille it asap
waking up the idle timer task to release the connection.
Add a check to quic_transport_params_store() to detect that the SCIDs do not
match and make it call qc_kill_con().
Add several tests about connection to be killed at several critial locations,
especially in the TLS stack callback to receive CRYPTO data from or derive secrets,
and before preparing packet after having received others.
Must be backported to 2.6 and 2.7.
This is a bad idea to make the TLS ClientHello callback call qc_conn_finalize().
If this latter fails, this would generate a TLS alert and make the connection
send packet whereas it is not functional. But qc_conn_finalize() job was to
install the transport parameters sent by the QUIC listener. This installation
cannot be done at any time. This must be done after having possibly negotiated
the QUIC version and before sending the first Handshake packets. It seems
the better moment to do that in when the Handshake TX secrets are derived. This
has been found inspecting the ngtcp2 code. Calling SSL_set_quic_transport_params()
too late would make the ServerHello to be sent without the transport parameters.
The code for the connection update which was done from qc_conn_finalize() has
been moved to quic_transport_params_store(). So, this update is done as soon as
possible.
Add QUIC_FL_CONN_TX_TP_RECEIVED to flag the connection as having received the
peer transport parameters. Indeed this is required when the ClientHello message
is splitted between packets.
Add QUIC_FL_CONN_FINALIZED to protect the connection from calling qc_conn_finalize()
more than one time. This latter is called only when the connection has received
the transport parameters and after returning from SSL_do_hanshake() which is the
function which trigger the TLS ClientHello callback call.
Remove the calls to qc_conn_finalize() from from the TLS ClientHello callbacks.
Must be backported to 2.6. and 2.7.
This bug was revealed by some C1 interop tests (heavy hanshake packet
corruption) when receiving 1-RTT packets with a key phase update.
This lead the packet to be decrypted with the next key phase secrets.
But this latter is initialized only after the handshake is complete.
In fact, 1-RTT must never be processed before the handshake is complete.
Relying on the "qc->mux_state == QC_MUX_NULL" condition to check the
handshake is complete is wrong during 0-RTT sessions when the mux
is initialized before the handshake is complete.
Must be backported to 2.7 and 2.6.
This is not really a bug fix but an improvement. When the Handshake packet number
space has been detected as needed to be probed, we should also try to probe the
Initial packet number space if there are still packets in flight. Furthermore
we should also try to send up to two datagrams.
Must be backported to 2.6 and 2.7.
This function is called only when probing only one packet number space
(Handshake) or two times the same one (Application). So, there is no risk
to prepare two times the same frame when uneeded because we wanted to
probe two packet number spaces. The condition "ignore the packets which
has been coalesced to another one" is not necessary. More importantly
the bug is when we want to prepare a Application packet which has
been coalesced to an Handshake packet. This is always the case when
the first Application packet is sent. It is always coalesced to
an Handshake packet with an ACK frame. So, when lost, this first
application packet was never resent. It contains the HANDSHAKE_DONE
frame to confirm the completion of the handshake to the client.
Must be backported to 2.6 and 2.7.
During the handshake and when the handshake has not been confirmed
the acknowledgement delays reported by the peer may be larger
than max_ack_delay. max_ack_delay SHOULD be ignored before the
handshake is completed when computing the PTO. But the current code considered
the wrong condition "before the hanshake is completed".
Replace the enum value QUIC_HS_ST_COMPLETED by QUIC_HS_ST_CONFIRMED to
fix this issue. In quic_loss.c, the parameter passed to quic_pto_pktns()
is renamed to avoid any possible confusion.
Must be backported to 2.7 and 2.6.
This may happen during retransmission of frames which can be splitted
(CRYPTO, or STREAM frames). One may have to split a frame to be
retransmitted due to the QUIC protocol properties (packet size limitation
and packet field encoding sizes). The remaining part of a frame which
cannot be retransmitted must be detached from the original frame it is
copied from. If not, when the really sent part will be acknowledged
the remaining part will be acknowledged too but not sent!
Must be backported to 2.7 and 2.6.
Add cum_sess_ver[] new array of counters to count the number of cumulated
HTTP sessions by version (h1, h2 or h3).
Implement proxy_inc_fe_cum_sess_ver_ctr() to increment these counter.
This function is called each a HTTP mux is correctly initialized. The QUIC
must before verify the application operations for the mux is for h3 before
calling proxy_inc_fe_cum_sess_ver_ctr().
ST_F_SESS_OTHER stat field for the cumulated of sessions others than
HTTP sessions is deduced from ->cum_sess_ver counter (for all the session,
not only HTTP sessions) from which the HTTP sessions counters are substracted.
Add cum_req[] new array of counters to count the number of cumulated HTTP
requests by version and others than HTTP requests. This new member replace ->cum_req.
Modify proxy_inc_fe_req_ctr() which increments these counters to pass an HTTP
version, 0 special values meaning "other than an HTTP request". This is the case
for instance for syslog.c from which proxy_inc_fe_req_ctr() is called with 0
as version parameter.
ST_F_REQ_TOT stat field compputing for the cumulated number of requests is modified
to count the sum of all the cum_req[] counters.
As this patch is useful for QUIC, it must be backported to 2.7.
On high traffic benchmarks, it's visible the the CPU is dominated by
calls to memcpy(), and many of those come from htx functions. It was
measured that 63% of those coming from htx are made on 8-byte blocks
which really are not worth a call to the function since a single
read-write cycle does it fine.
This commit adds an inline htx_memcpy() function that explicitly
checks for this length and just copies the data without that call.
It's even likely that it could be detected on const sizes, though
that was not done. This is already effective in reducing the number
of calls to memcpy().
Define a new configuration option "tune.quic.max-frame-loss". This is
used to specify the limit for which a single frame instance can be
detected as lost. If exceeded, the connection is closed.
This should be backported up to 2.7.
Add a <loss_count> new field in quic_frame structure. This field is set
to 0 and incremented each time a sent packet is declared lost. If
<loss_count> reached a hard-coded limit, the connection is deemed as
failing and is closed immediately with a CONNECTION_CLOSE using
INTERNAL_ERROR.
By default, limit is set to 10. This should ensure that overall memory
usage is limited if a peer behaves incorrectly.
This should be backported up to 2.7.
Define a new function qc_frm_free() to handle frame deallocation. New
BUG_ON() statements ensure that the deallocated frame is not referenced
by other frame. To support this, all LIST_DELETE() have been replaced by
LIST_DEL_INIT(). This should enforce that frame deallocation is robust.
As a complement, qc_frm_unref() has been moved into quic_frame module.
It is justified as this is a utility function related to frame
deallocation. It allows to use it in quic_pktns_tx_pkts_release() before
calling qc_frm_free().
This should be backported up to 2.7.
Define two utility functions for quic_frame allocation :
* qc_frm_alloc() is used to allocate a new frame
* qc_frm_dup() is used to allocate a new frame by duplicating an
existing one
Theses functions are useful to centralize quic_frame initialization.
Note that pool_zalloc() is replaced by a proper pool_alloc() + explicit
initialization code.
This commit will simplify implementation of the per frame retransmission
limitation. Indeed, a new counter will be added in quic_frame structure
which must be initialized to 0.
This should be backported up to 2.7.
Care must be taken when reading/writing offset for STREAM frames. A
special OFF bit is set in the frame type to indicate that the field is
present. If not set, it is assumed that offset is 0.
To represent this, offset field of quic_stream structure must always be
initialized with a valid value in regards with its frame type OFF bit.
The previous code has no bug in part because pool_zalloc() is used to
allocate quic_frame instances. To be able to use pool_alloc(), offset is
always explicitely set to 0. If a non-null value is used, OFF bit is set
at the same occasion. A new BUG_ON() statement is added on frame builder
to ensure that the caller has set OFF bit if offset is non null.
This should be backported up to 2.7.
A dedicated <fin> field was used in quic_stream structure. However, this
info is already encoded in the frame type field as specified by QUIC
protocol.
In fact, only code for packet reception used the <fin> field. On the
sending side, we only checked for the FIN bit. To align both sides,
remove the <fin> field and only used the FIN bit.
This should be backported up to 2.7.
In an attempt to fix GH #1873, ("BUG/MEDIUM: stats: Rely on a local trash
buffer to dump the stats") explicitly reduced output buffer size to leave
enough space for htx overhead under http context.
Github user debtsandbooze, who first reported the issue, came back to us
and said he was still able to make the http dump "hang" with the new fix.
After some tests, it became clear that htx_add_data_atonce() could fail from
time to time in stats_putchk(), even if htx was completely empty:
In http context, buffer size is maxed out at channel_htx_recv_limit().
Unfortunately, channel_htx_recv_limit() is not what we're looking for here
because limit() doesn't compute the proper htx overhead.
Using buf_room_for_htx_data() instead of channel_htx_recv_limit() to compute
max "usable" data space seems to be the last piece of work required for
the previous fix to work properly.
This should be backported everywhere the aforementioned commit is.
idle and harmless bits in the tgroup_ctx structure were not explicitly
set during boot.
| struct tgroup_ctx ha_tgroup_ctx[MAX_TGROUPS] = { };
As the structure is first statically initialized,
.threads_harmless and .threads_idle are automatically zero-
initialized by the compiler.
Unfortulately, this means that such threads are not considered idle
nor harmless by thread_isolate(_full)() functions until they enter
the polling loop (thread_harmless_now() and thread_idle_now() are
respectively called before entering the polling loop)
Because of this, any attempt to call thread_isolate() or thread_isolate_full()
during a startup phase with nbthreads >= 2 will cause thread_isolate to
loop until every secondary threads make it through their first polling loop.
If the startup phase is aborted during boot (ie: "-c" option to check the
configuration), secondary threads may be initialized but will never be started
(ie: they won't enter the polling loop), thus thread_isolate()
could would loop forever in such cases.
We can easily reveal the bug with this patch reproducer:
| diff --git a/src/haproxy.c b/src/haproxy.c
| index e91691658..0b733f6ee 100644
| --- a/src/haproxy.c
| +++ b/src/haproxy.c
| @@ -2317,6 +2317,10 @@ static void init(int argc, char **argv)
| if (pr || px) {
| /* At least one peer or one listener has been found */
| qfprintf(stdout, "Configuration file is valid\n");
| + printf("haproxy will loop...\n");
| + thread_isolate();
| + printf("we will never reach this\n");
| + thread_release();
| deinit_and_exit(0);
| }
| qfprintf(stdout, "Configuration file has no error but will not start (no listener) => exit(2).\n");
Now we start haproxy with a valid config:
$> haproxy -c -f valid.conf
Configuration file is valid
haproxy will loop...
^C
------------------------------------------------------------------------------
This did not cause any issue so far because no early deinit paths require
full thread isolation. But this may change when new features or requirements
are introduced, so we should fix this before it becomes a real issue.
To fix this, we explicitly assign .threads_harmless and .threads_idle
to .threads_enabled value in thread_map_to_groups() function during boot.
This is the proper place to do this since as long as .threads_enabled is not
explicitly set, its default value is also 0 (zero-initialized by the compiler)
code snippet from thread_isolate() function:
ulong te = _HA_ATOMIC_LOAD(&ha_tgroup_info[tgrp].threads_enabled);
ulong th = _HA_ATOMIC_LOAD(&ha_tgroup_ctx[tgrp].threads_harmless);
if ((th & te) == te)
break;
Thus thread_isolate(_full()) won't be looping forever in thread_isolate()
even if it were to be used before thread_map_to_groups() is executed.
No backport needed unless this is a requirement.
This commit is identical to the preceeding patch. However, these traces
are from another patch with a different backport scope :
56a86ddfb9
MINOR: h3: add missing traces on closure
This must be backported up to 2.7 where above patch is scheduled.
First H3 traces argument must be a connection instance or a NULL. Some
new traces were added recently with a qcc instance which caused a crash
when traces are activated.
This trace was added by the following patch :
87f8766d3f
BUG/MEDIUM: h3: handle STOP_SENDING on control stream
This must be backported up to 2.6 along with the above patch.
When using WolfSSL, there are some cases were the SSL_CTX_sess_new_cb is
called with an existing session ID. These cases are not met with
OpenSSL.
When the ID is found in the session tree during the insertion, the
shared_block len is not set to 0 and is not used. However if later the
block is reused, since the len is not set to 0, the release callback
will be called an ebmb_delete will be tried on the block, even if it's
not in the tree, provoking a crash.
The code was buggy from the beginning, but the case never happen with
openssl which changes the ID.
Must be backported in every maintained branches.
Add traces for function h3_shutdown() / h3_send_goaway(). This should
help to debug problems related to connection closure.
This should be backported up to 2.7.
This commit is similar to the previous one. It reports an error if a
RESET_STREAM is received for the remote control stream. This will
generate a CONNECTION_CLOSE with H3_CLOSED_CRITICAL_STREAM error.
Note that contrary to the previous bug related to STOP_SENDING, this bug
was not encountered in real environment. As such, it is labelled as
MINOR. However, it could triggered the same crash as the previous patch.
This should be backported up to 2.6.
Before this patch, STOP_SENDING reception was considered valid even on
H3 control stream. This causes the emission in return of RESET_STREAM
and eventually the closure and freeing of the QCS instance. This then
causes a crash during connection closure as a GOAWAY frame is emitted on
the control stream which is now released.
To fix this crash, STOP_SENDING on the control stream is now properly
rejected as specified by RFC 9114. The new app_ops close callback is
used which in turn will generate a CONNECTION_CLOSE with error
H3_CLOSED_CRITICAL_STREAM.
This bug was detected in github issue #2006. Note that however it is
triggered by an incorrect client behavior. It may be useful to determine
which client behaves like this. If this case is too frequent,
STOP_SENDING should probably be silently ignored.
To reproduce this issue, quiche was patched to emit a STOP_SENDING on
its send() function in quiche/src/lib.rs:
pub fn send(&mut self, out: &mut [u8]) -> Result<(usize, SendInfo)> {
- self.send_on_path(out, None, None)
+ let ret = self.send_on_path(out, None, None);
+ self.streams.mark_stopped(3, true, 0);
+ ret
}
This must be backported up to 2.6 along with the preceeding commit :
MINOR: mux-quic/h3: define close callback
Define a new qcc_app_ops callback named close(). This will be used to
notify app-layer about the closure of a stream by the remote peer. Its
main usage is to ensure that the closure is allowed by the application
protocol specification.
For the moment, close is not implemented by H3 layer. However, this
function will be mandatory to properly reject a STOP_SENDING on the
control stream and preventing a later crash. As such, this commit must
be backported with the next one on 2.6.
This is related to github issue #2006.
h3_resp_trailers_send() may be called due to an HTX EOT block present
without preceeding HTX TRAILER block. In this case, no HEADERS frame
will be generated by H3 layer and MUX will emit an empty STREAM frame
with FIN set.
However, before skipping these, some operations are conducted on qcs
buffer to realign it and try to encode the QPACK field section line in a
buffer copy. These operation are thus unneeded if no trailer is
generated. Even worse, the function will fail if there is not enough
space in the buffer for the superfluous QPACK section line.
To improve this situation, this patch adds an early goto statement to
skip most operations in h3_resp_trailers_send() if no HTX trailer block
is found.
This patch is related to github issue #2006.
This should be backported up to 2.7.
Replace ABORT_NOW() by proper error management in
h3_resp_trailers_send() for QPACK encoding operation.
If a QPACK encoding operation fails, it means there is not enough space
in qcs buffer. In this case, flag qcs instance with QC_SF_BLK_MROOM and
return an error. MUX is responsible to remove this flag once buffer
space is available.
This should fix the crash reported by gabrieltz on github issue #2006.
This must be backported up to 2.7.
In http_build_7239_header_nodename(), ip6 address dumping is performed
at a single place to prevent code duplication:
A goto statement combined with a local pointer variable (ip6_addr) were used
to perform ipv6 dump from different calling places inside the function.
However, when the goto was performed (ie: sample expression handling),
ip6_addr pointer was assigned to limited scope variable's address that is not
valid within the dumping code.
Because of this, we have an undefined behavior that could result in a bug or
a crash depending on the platform that is running haproxy.
This was found by Coverity (GH #2018)
To fix this, we add a simple ip6 printing helper that takes the ip6_addr
pointer as an argument.
This prevents any scope related bug as the function is executed under the
proper context.
if/else guards inside the function were reviewed to make sure that the goto
removal won't affect existing behavior.
----------
No backport needed, except if the commit ("MINOR: proxy/http_ext: introduce
proxy forwarded option") is backported.
Given that this commit needs to be backported with
"MINOR: proxy/http_ext: introduce proxy forwarded option", We're using it as a
reminder for another bug that was introduced with
"MINOR: proxy/http_ext: introduce proxy forwarded option" but has been silently
fixed since with
"MEDIUM: proxy/http_ext: implement dynamic http_ext".
If "MINOR: proxy/http_ext: introduce proxy forwarded option" needs to be
backported without
"MEDIUM: proxy/http_ext: implement dynamic http_ext", you should manually apply
the following patch on top of it:
| diff --git a/src/http_ext.c b/src/http_ext.c
| index fcb5a07bc..3921357a3 100644
| --- a/src/http_ext.c
| +++ b/src/http_ext.c
| @@ -609,7 +609,7 @@ static inline void http_build_7239_header_node(struct buffer *out,
| if (forby->np_mode)
| chunk_appendf(out, "\"");
| offset_save = out->data;
| - http_build_7239_header_node(out, s, curproxy, addr, &curproxy->http.fwd.p_by);
| + http_build_7239_header_nodename(out, s, curproxy, addr, forby);
| if (offset_save == out->data) {
| /* could not build nodename, either because some
| * data is not available or user is providing bad input
| @@ -619,7 +619,7 @@ static inline void http_build_7239_header_node(struct buffer *out,
| if (forby->np_mode) {
| chunk_appendf(out, ":");
| offset_save = out->data;
| - http_build_7239_header_nodeport(out, s, curproxy, addr, &curproxy->http.fwd.p_by);
| + http_build_7239_header_nodeport(out, s, curproxy, addr, forby);
| if (offset_save == out->data) {
| /* could not build nodeport, either because some data is
| * not available or user is providing bad input
(If you don't, forwarded option won't work properly and will crash
haproxy (stack overflow) when building 'for' or 'by' parameter)
As reported by Coverity, this function may be called with no h2c. Thus, the
pointer must always be checked before any access. One test was missing in
TRACE_PRINTF_LOC().
This patch should fix the issue #2015. No backport needed, except if the
commit 11e8a8c2a ("MEDIUM: mux-h2/trace: add tracing support for headers")
is backported.
Despite the doc saying that 'use-fcgi-app' keyword may only be used in backend
or listen section, we forgot to prevent its usage in default section.
This is wrong because fcgi relies on a filter, and filters cannot be defined in
a default section.
Making sure such usage reports an error to the user and complies with the doc.
This could be backported up to 2.2.
This is a simple refactor to remove specific http_ext post-parsing treatment from
cfgparse.
Related work is now performed internally through check_http_ext_postconf()
function that is registered via REGISTER_POST_PROXY_CHECK() in http_ext.c.
proxy http-only options implemented in http_ext were statically stored
within proxy struct.
We're making some changes so that http_ext are now stored in a dynamically
allocated structs.
http_ext related structs are only allocated when needed to save some space
whenever possible, and they are automatically freed upon proxy deletion.
Related PX_O_HTTP{7239,XFF,XOT) option flags were removed because we're now
considering an http_ext option as 'active' if it is allocated (ptr is not NULL)
A few checks (and BUG_ON) were added to make these changes safe because
it adds some (acceptable) complexity to the previous design.
Also, proxy.http was renamed to proxy.http_ext to make things more explicit.
http option forwarded (rfc7239) supports sample expressions when
configuring 'host', 'for' and 'by' parameters.
However, since we are in a http-backend-only context, right after http
header is processed, we have a limited resolution scope for the sample
expression provided by the user.
To prevent any confusion, a warning is emitted when parsing the option
if the user relies on a sample expression (more precisely a fetch) which
would yield unexpected results at runtime when processing the option.
forwarded header option (rfc7239) deals with sample expressions in two
steps: first a sample expression string is extracted from the config file
and later in startup sequence this string is converted into the resulting
sample_expr.
We need to perform these two steps because we cannot compile the expr
too early in the parsing sequence. (or we would miss some context)
Because of this, we have two dinstinct structure members (expr and expr_s)
for each 7239 field supporting sample expressions.
This is not cool, because we're bloating the http forwarded config structure,
and thus, bloating proxy config structure.
To address this, we now merge both expr and expr_s members inside a single
union to regain some space. This forces us to perform some additional logic
to make sure to use the proper structure member at different parsing steps.
Thanks to this, we're also able to free/release related config hints and
sample expression strings as soon as the sample expression
compilation is finished.
Adding new http converter: rfc7239_n2np.
Takes a string representing 7239 forwarded header node (extracted from
either 'for' or 'by' 7239 header fields) as input and translates it
to either unsigned integer or ('_' prefixed obfuscated identifier),
according to 7239RFC.
Example:
# extract 'by' field from forwarded header, extract node port from
# resulting node identifier and store the result in req.fnp
http-request set-var(req.fnp) req.hdr(forwarded),rfc7239_field(by),rfc7239_n2np
#input: "by=\"127.0.0.1:9999\""
# output: 9999
#input: "by=\"_name:_port\""
# output: "_port"
Depends on:
- "MINOR: http_ext: introduce http ext converters"
Adding new http converter: rfc7239_n2nn.
Takes a string representing 7239 forwarded header node (extracted from
either 'for' or 'by' 7239 header fields) as input and translates it
to either ipv4 address, ipv6 address or str ('_' prefixed if obfuscated
or "unknown" if unknown), according to 7239RFC.
Example:
# extract 'for' field from forwarded header, extract nodename from
# resulting node identifier and store the result in req.fnn
http-request set-var(req.fnn) req.hdr(forwarded),rfc7239_field(for),rfc7239_n2nn
#input: "for=\"127.0.0.1:9999\""
# output: 127.0.0.1
#input: "for=\"_name:_port\""
# output: "_name"
Depends on:
- "MINOR: http_ext: introduce http ext converters"
Adding new http converter: rfc7239_field.
Takes a string representing 7239 forwarded header single value as
input and extracts a single field/parameter from the header according
to user selection.
Example:
# extract host field from forwarded header and store it in req.fhost var
http-request set-var(req.fhost) req.hdr(forwarded),rfc7239_field(host)
#input: "proto=https;host=\"haproxy.org:80\""
# output: "haproxy.org:80"
# extract for field from forwarded header and store it in req.ffor var
http-request set-var(req.ffor) req.hdr(forwarded),rfc7239_field(for)
#input: "proto=https;host=\"haproxy.org:80\";for=\"127.0.0.1:9999\""
# output: "127.0.0.1:9999"
Depends on:
- "MINOR: http_ext: introduce http ext converters"
Adding new http converter: rfc7239_is_valid.
Takes a string representing 7239 forwarded header single value as
input and returns bool:TRUE if header is RFC compliant and
bool:FALSE otherwise.
Example:
acl valid req.hdr(forwarded),rfc7239_is_valid
#input: "for=127.0.0.1;proto=http"
# output: TRUE
#input: "proto=custom"
# output: FALSE
Depends on:
- "MINOR: http_ext: introduce http ext converters"
This commit is really simple, it adds the required skeleton code to allow
new http_ext converter to be easily registered through STG_REGISTER facility.
Just like forwarded (7239) header and forwardfor header, move parsing,
logic and management of 'originalto' option into http_ext dedicated class.
We're only doing this to standardize proxy http options management.
Existing behavior remains untouched.
Just like forwarded (7239) header, move parsing, logic and management
of 'forwardfor' option into http_ext dedicated class.
We're only doing this to standardize proxy http options management.
Existing behavior remains untouched.
Introducing http_ext class for http extension related work that
doesn't fit into existing http classes.
HTTP extension "forwarded", introduced with 7239 RFC is now supported
by haproxy.
The option supports various modes from simple to complex usages involving
custom sample expressions.
Examples :
# Those servers want the ip address and protocol of the client request
# Resulting header would look like this:
# forwarded: proto=http;for=127.0.0.1
backend www_default
mode http
option forwarded
#equivalent to: option forwarded proto for
# Those servers want the requested host and hashed client ip address
# as well as client source port (you should use seed for xxh32 if ensuring
# ip privacy is a concern)
# Resulting header would look like this:
# forwarded: host="haproxy.org";for="_000000007F2F367E:60138"
backend www_host
mode http
option forwarded host for-expr src,xxh32,hex for_port
# Those servers want custom data in host, for and by parameters
# Resulting header would look like this:
# forwarded: host="host.com";by=_haproxy;for="[::1]:10"
backend www_custom
mode http
option forwarded host-expr str(host.com) by-expr str(_haproxy) for for_port-expr int(10)
# Those servers want random 'for' obfuscated identifiers for request
# tracing purposes while protecting sensitive IP information
# Resulting header would look like this:
# forwarded: for=_000000002B1F4D63
backend www_for_hide
mode http
option forwarded for-expr rand,hex
By default (no argument provided), forwarded option will try to mimic
x-forward-for common setups (source client ip address + source protocol)
The option is not available for frontends.
no option forwarded is supported.
More info about 7239 RFC here: https://www.rfc-editor.org/rfc/rfc7239.html
More info about the feature in doc/configuration.txt
This should address feature request GH #575
Depends on:
- "MINOR: http_htx: add http_append_header() to append value to header"
- "MINOR: sample: add ARGC_OPT"
- "MINOR: proxy: introduce http only options"
Just like http_append_header(), but this time to insert new value before
an existing one.
If the header already contains one or multiple values, ',' is automatically
inserted after the new value.
Calling this function as an alternative to http_replace_header_value()
to append a new value to existing header instead of replacing the whole
header content.
If the header already contains one or multiple values: a ',' is automatically
appended before the new value.
This function is not meant for prepending (providing empty ctx value),
in which case we should consider implementing dedicated prepend
alternative function.
The functions in charge of processing headers have their names in the
traces and they're among the longest of the mux_h2.c file, while even
containing some redundancy. These names are not used outside, let's
shorten them:
- h2c_decode_headers -> h2c_dec_hdrs
- h2s_bck_make_req_headers -> h2s_snd_bhdrs
- h2s_frt_make_resp_headers -> h2s_snd_fhdrs
Now the traces are a bit more readable:
[00|h2|5|mux_h2.c:4822] h2c_dec_hdrs(): h2c=0x1870510(F,FRP) dsi=1 rcvh :method: GET
[00|h2|5|mux_h2.c:4822] h2c_dec_hdrs(): h2c=0x1870510(F,FRP) dsi=1 rcvh :path: /
[00|h2|5|mux_h2.c:4822] h2c_dec_hdrs(): h2c=0x1870510(F,FRP) dsi=1 rcvh :scheme: http
[00|h2|5|mux_h2.c:4822] h2c_dec_hdrs(): h2c=0x1870510(F,FRP) dsi=1 rcvh :authority: localhost:14446
[00|h2|5|mux_h2.c:4822] h2c_dec_hdrs(): h2c=0x1870510(F,FRP) dsi=1 rcvh user-agent: curl/7.54.1
[00|h2|5|mux_h2.c:4822] h2c_dec_hdrs(): h2c=0x1870510(F,FRP) dsi=1 rcvh accept: */*
Now we can make use of TRACE_PRINTF() to iterate over headers as they
are received or dumped. It's worth noting that the dumps may occasionally
be interrupted due to a buffer full or a realign, but in this case it
will be visible because the trace will restart from the first one. All
these headers (and trailers) may be interleaved with other connections'
so they're all preceeded by the pointer to the connection and optionally
the stream (or alternately the stream ID) to help discriminating them.
Since it's not easy to read the header directions, sent headers are
prefixed with "sndh" and received headers are prefixed with "rcvh", both
of which are rare enough in the traces to conveniently support a quick
grep.
In order to avoid code duplication, h2_encode_headers() was implemented
as a wrapper on top of hpack_encode_header(), which optionally emits the
header to the trace if the trace is active. In addition, for headers that
are encoded using a different method, h2_trace_header() was added as well.
Header names are truncated to 256 bytes and values to 1024 bytes. If
the lengths are larger, they will be truncated and suffixed with
"(... +xxx)" where "xxx" is the number of extra bytes.
Example of what an end-to-end H2 request gives:
[00|h2|5|mux_h2.c:4818] h2c_decode_headers(): h2c=0x1c13120(F,FRP) dsi=1 rcvh :method: GET
[00|h2|5|mux_h2.c:4818] h2c_decode_headers(): h2c=0x1c13120(F,FRP) dsi=1 rcvh :path: /
[00|h2|5|mux_h2.c:4818] h2c_decode_headers(): h2c=0x1c13120(F,FRP) dsi=1 rcvh :scheme: http
[00|h2|5|mux_h2.c:4818] h2c_decode_headers(): h2c=0x1c13120(F,FRP) dsi=1 rcvh :authority: localhost:14446
[00|h2|5|mux_h2.c:4818] h2c_decode_headers(): h2c=0x1c13120(F,FRP) dsi=1 rcvh user-agent: curl/7.54.1
[00|h2|5|mux_h2.c:4818] h2c_decode_headers(): h2c=0x1c13120(F,FRP) dsi=1 rcvh accept: */*
[00|h2|5|mux_h2.c:4818] h2c_decode_headers(): h2c=0x1c13120(F,FRP) dsi=1 rcvh cookie: blah
[00|h2|5|mux_h2.c:5491] h2s_bck_make_req_headers(): h2c=0x1c1cd90(B,FRH) h2s=0x1c1e3d0(1,IDL) sndh :method: GET
[00|h2|5|mux_h2.c:5572] h2s_bck_make_req_headers(): h2c=0x1c1cd90(B,FRH) h2s=0x1c1e3d0(1,IDL) sndh :authority: localhost:14446
[00|h2|5|mux_h2.c:5596] h2s_bck_make_req_headers(): h2c=0x1c1cd90(B,FRH) h2s=0x1c1e3d0(1,IDL) sndh :path: /
[00|h2|5|mux_h2.c:5647] h2s_bck_make_req_headers(): h2c=0x1c1cd90(B,FRH) h2s=0x1c1e3d0(1,IDL) sndh user-agent: curl/7.54.1
[00|h2|5|mux_h2.c:5647] h2s_bck_make_req_headers(): h2c=0x1c1cd90(B,FRH) h2s=0x1c1e3d0(1,IDL) sndh accept: */*
[00|h2|5|mux_h2.c:5647] h2s_bck_make_req_headers(): h2c=0x1c1cd90(B,FRH) h2s=0x1c1e3d0(1,IDL) sndh cookie: blah
[00|h2|5|mux_h2.c:4818] h2c_decode_headers(): h2c=0x1c1cd90(B,FRP) dsi=1 rcvh :status: 200
[00|h2|5|mux_h2.c:4818] h2c_decode_headers(): h2c=0x1c1cd90(B,FRP) dsi=1 rcvh content-length: 0
[00|h2|5|mux_h2.c:4818] h2c_decode_headers(): h2c=0x1c1cd90(B,FRP) dsi=1 rcvh x-req: size=102, time=0 ms
[00|h2|5|mux_h2.c:4818] h2c_decode_headers(): h2c=0x1c1cd90(B,FRP) dsi=1 rcvh x-rsp: id=dummy, code=200, cache=1, size=0, time=0 ms (0 real)
[00|h2|5|mux_h2.c:5210] h2s_frt_make_resp_headers(): h2c=0x1c13120(F,FRH) h2s=0x1c1c780(1,HCR) sndh :status: 200
[00|h2|5|mux_h2.c:5231] h2s_frt_make_resp_headers(): h2c=0x1c13120(F,FRH) h2s=0x1c1c780(1,HCR) sndh content-length: 0
[00|h2|5|mux_h2.c:5231] h2s_frt_make_resp_headers(): h2c=0x1c13120(F,FRH) h2s=0x1c1c780(1,HCR) sndh x-req: size=102, time=0 ms
[00|h2|5|mux_h2.c:5231] h2s_frt_make_resp_headers(): h2c=0x1c13120(F,FRH) h2s=0x1c1c780(1,HCR) sndh x-rsp: id=dummy, code=200, cache=1, size=0, time=0 ms (0 real)
At some point the frontend/backend names would be useful but that's a more
general comment than just the H2 traces.
By default, passing a NULL cb to the trace functions will result in the
source's default one to be used. For some cases we won't want to use any
callback at all, not event the default one. Let's define a trace_no_cb()
function for this, that does absolutely nothing.
Sometimes it would be necessary to prepare some messages, pre-process some
blocks or maybe duplicate some contents before they vanish for the purpose
of tracing them. However we don't want to do that for everything that is
submitted to the traces, it's important to do it only for what will really
be traced.
The __trace() function has all the knowledge for this, to the point of even
checking the lockon pointers. This commit splits the function in two, one
with the trace decision logic, and the other one for the trace production.
The first one is now usable through wrappers such as _trace_enabled() and
TRACE_ENABLED() which will indicate whether traces are going to be produced
for the current source, level, event mask, parameters and tracking.
There are ifdefs at several places to only define TRC_ARGS_QCON when
QUIC is defined, but nothing prevents this code from building without.
Let's just remove those ifdefs, the single "if" they avoid is not worth
the extra maintenance burden.
Since 2.6 we have a free_logsrv() function that is used to release log
servers. It must be called from deinit() instead of manually iterating
over the log servers, otherwise some parts of the structure are not
freed (namely the ring name), as reported by ASAN.
This should be backported to 2.6.
Commit 9f4f6b038 ("OPTIM: hpack-huff: reduce the cache footprint of the
huffman decoder") replaced the large tables with more space efficient
byte arrays, but one table, rht_bit15_11_11_4, has a 64 bytes hole in it
that wasn't materialized by filling it with zeroes to make the offsets
match, nor by adjusting the offset from the caller. This resulted in
some control chars not properly being decoded and being seen as byte 0,
and the associated messages to be rejected, as can be seen in issue #1971.
This commit fixes it by adjusting the offset used for the higher part of
the table so that we don't need to store 64 zeroes that will never be
accessed.
This needs to be backported to 2.7.
Thanks to Christopher for spotting the bug, and to Juanga Covas for
providing precious traces showing the problem.
A major regression was introduced by following patch
commit 71fd03632f
MINOR: mux-quic/h3: send SETTINGS as soon as transport is ready
H3 finalize operation is now called at an early stage in the middle of
qc_init(). However, some qcc members are not yet initialized. In
particular the stream tree which will cause a crash when H3 control
stream will be accessed.
To fix this, qcc_install_app_ops() has been delayed at the end of
qc_init(). This ensures that qcc is properly initialized when app_ops
operation are used.
This must be backported wherever above patch is. For the record, it has
been tagged up to 2.7.
Since the rework of QUIC streams send scheduling, each stream has to be
inserted in QUIC-mux send-list to be able to emit content. This was not
the case for GOAWAY which prevent it to be sent. This regression has
been introduced by the following patch :
commit 20f2a425ff
MAJOR: mux-quic: rework stream sending priorization
This new patch fixes the issue by inserting H3 control stream in mux
send-list. The impact is deemed minor as for the moment GOAWAY is only
sent just before connection/mux cleanup with a CONNECTION_CLOSE.
However, it might cause some connections to hang up indefinitely.
This should be backported up to 2.7.
As specified by HTTP3 RFC, SETTINGS frame should be sent as soon as
possible. Before this patch, this was only done on the first qc_send()
invocation. This delay significantly SETTINGS emission until the first
H3 response is ready to be transferred.
This patch fixes this by ensuring SETTINGS is emitted when MUX-QUIC is
being setup.
As a side point, return value of finalize operation is checked. This
means that an error during SETTINGS emission will cause the connection
init to fail.
This should be backported up to 2.7.
Add a BUG_ON() in conn_free(), to check that when we're freeing a
connection, it is not still in the idle connections tree, otherwise the
next thread that will try to use it will probably crash.
This patch fixes two leaks in the 'update ssl ocsp-response' cli
command. One rather significant one since a whole trash buffer was
allocated for every call of the command, and another more marginal one
in an error path.
This patch does not need to be backported.
The munmap() call performed on exit was incorrect since it used to apply
to the buffer instead of the area, so neither the pointer nor the size
were page-aligned. This patches corrects this and also adds a call to
msync() since munmap() alone doesn't guarantee that data will be dumped.
This should be backported to 2.6.
When receiving a QUIC datagram, destination address is retrieved via
recvmsg() and stored in quic-conn as qc.local_addr. This address is then
reused when using the quic-conn owned socket.
When listener socket mode is preferred, send operation did not specify
the source address of the emitted datagram. If listener socket is bound
on a wildcard address, the kernel is free to choose any address assigned
to the local machine. This may be different from the address selected by
the client on its first datagram which will prevent the client to emit
next replies.
To address this, this patch fixes the UDP source address via sendmsg().
This process is similar to the reception and relies on ancillary
message, so the socket is left untouched after the operation. This is
heavily platform specific and may not be supported by some kernels.
This change has only an impact if listener socket only is used for QUIC
communications. This is the default behavior for 2.7 branch but not
anymore on 2.8. Use tune.quic.socket-owner set to listener to ensure set
it.
This should be backported up to 2.7.
It is forbidden to request h3 clients to close its Control and QPACK unidirection
streams. If not, the client closes the connection with H3_CLOSED_CRITICAL_STREAM(0x104).
Perhaps this could prevent some clients as Chrome to come back for a while.
But at quic_conn level there is no mean to identify the streams for which we cannot
send STOP_SENDING frame. Such a possibility is even not mentionned in RFC 9000.
At this time there is no choice than stopping sending STOP_SENDING frames for
all the h3 unidirectional streams inspecting the ->app_opps quic_conn value.
Must be backported to 2.7 and 2.6.
The wrong return value was checked, resulting in dead code and
potential bugs.
It should fix GitHub issue #2005.
This patch should be backported up to 2.5.
In case HPACK cannot be decoded, logs are emitted but there's no info
in the H2 traces, so let's add them.
This may be backported to all supported versions.
As reported by Dominik Froehlich in github issue #1968, some H2 request
parsing errors do not result in a log being emitted. This is annoying
for debugging because while an RST_STREAM is correctly emitted to the
client, there's no way without enabling traces to find it on the
haproxy side.
After some testing with various abnormal requests, a few places were
found where logs were missing and could be added. In this case, we
simply use sess_log() so some sample fetch functions might not be
available since the stream is not created. But at least there will
be a BADREQ in the logs. A good eaxmple of this consists in sending
forbidden headers or header syntax (e.g. presence of LF in value).
Some quick tests can be done this way:
- protocol error (LF in value):
curl -iv --http2-prior-knowledge -H "$(printf 'a:b\na')" http://0:8001/
- too large header block after decoding:
curl -v --http2-prior-knowledge -H "a:$(perl -e "print('a'x10000)")" -H "a:$(perl -e "print('a'x10000)")" http://localhost:8001/
This should be backported where needed, most likely 2.7 and 2.6 at
least for a start, and progressively to other versions.
The debug handler may deadlock with some threads waiting for isolation.
This may happend during a "show threads" command or even during a panic.
The reason is the call to thread_harmless_end() which waits for rdv_requests
to turn to zero before releasing its position in thread_dump_state,
while that one may not progress if another thread was interrupted in
thread_isolate() and is waiting for that thread to drop thread_dump_state.
In order to address this, we now use thread_harmless_end_sig() introduced
by previous commit:
MINOR: threads: add a thread_harmless_end() version that doesn't wait
However there's a catch: since commit f7afdd910 ("MINOR: debug: mark
oneself harmless while waiting for threads to finish"), there's a second
pair of thread_harmless_now()/thread_harmless_end() that surround the
loop around thread_dump_state. Marking a thread harmless before this
loop and dropping that without checking rdv_requests there could break
the harmless promise made to the other thread if it returns first and
proceeds with its isolated work. Hence we just drop this pair which was
only preventive for other signal handlers, while as indicated in that
patch's commit message, other signals are handled asynchronously and do
not require that extra protection.
This fix must be backported to 2.7.
The problem can be seen by running "show threads" in fast loops (100/s)
while reloading haproxy very quickly (10/s) and sending lots of traffic
to it (100krps, 15 Gbps). In this case the soft stop calls pool_gc()
which isolates a lot and manages to race with the dumps after a few
tens of seconds, leaving the process with all threads at 100%.
A few loops waiting for threads to synchronize such as thread_isolate()
rightfully filter the thread masks via the threads_enabled field that
contains the list of enabled threads. However, it doesn't use an atomic
load on it. Before 2.7, the equivalent variables were marked as volatile
and were always reloaded. In 2.7 they're fields in ha_tgroup_ctx[], and
the risk that the compiler keeps them in a register inside a loop is not
null at all. In practice when ha_thread_relax() calls sched_yield() or
an x86 PAUSE instruction, it could be verified that the variable is
always reloaded. If these are avoided (e.g. architecture providing
neither solution), it's visible in asm code that the variables are not
reloaded. In this case, if a thread exists just between the moment the
two values are read, the loop could spin forever.
This patch adds the required _HA_ATOMIC_LOAD() on the relevant
threads_enabled fields. It must be backported to 2.7.
Commit c1640f79f ("BUG/MEDIUM: fd/threads: fix incorrect thread selection
in wakeup broadcast") fixed an incorrect range being used to pick a thread
when broadcasting a wakeup for a foreign thread, but the selection was still
wrong as the number of threads and their mask was taken from the current
thread instead of the target thread. In addition, the code dealing with the
wakeup of a thread from the same group was still relying on MAX_THREADS
instead of tg->count.
This could theoretically cause random crashes with more than one thread
group though this was never encountered.
This needs to be backported to 2.7.
Implement the conversion of H3 request trailers as HTX blocks. This is
done through a new function h3_trailers_to_htx(). If the request
contains forbidden trailers it is rejected with a stream error.
This should be backported up to 2.7.
First, the inspect-delay is now tested if the action is used on a
tcp-response content rule. Then, when an expressions scope is checked, we
now take care to detect the right scope depending on the ruleset used
(tcp-request, tcp-response, http-request or http-response).
This patch could be backported to 2.7.
It is now possible to set a constant for the limit or period parameters on a
set-bandwidth-limit actions. The limit must follow the HAProxy size format
and is expressed in bytes. The period must follow the HAProxy time format
and is expressed in milliseconds. Of course, it is still possible to use
sample expressions instead.
The documentation was updated accordingly.
It is not really a bug. Only exemples were written this way in the
documentation. But it could be good to backport this change in 2.7.
This patch implement the conversion of an HTX response containing
trailer into a H3 HEADERS frame. This is done through a new function
named h3_resp_trailers_send().
This was tested with a nginx configuration using <add_trailer>
statement.
It may be possible that HTX buffer only contains a EOT block without
preceeding trailer. In this case, the conversion will produce nothing
but fin will be reported. This causes QUIC mux to generate an empty
STREAM frame with FIN bit set.
This should be backported up to 2.7.
Slighty adjust b_quic_enc_int(). This function is used to encode an
integer as a QUIC varint in a struct buffer.
A new parameter is added to the function API to specify the width of the
encoded integer. By default, 0 should be use to ensure that the minimum
space is used. Other valid values are 1, 2, 4 or 8. An error is reported
if the width is not large enough.
This new parameter will be useful when buffer space is reserved prior to
encode an unknown integer value. The maximum size of 8 bytes will be
reserved and some data can be put after. When finally encoding the
integer, the width can be requested to be 8 bytes.
With this new parameter, a small refactoring of the function has been
conducted to remove some useless internal variables.
This should be backported up to 2.7. It will be mostly useful to
implement H3 trailers encoding.
Connection headers are not used in HTTP/3. As specified by RFC 9114, a
received message containing one of those is considered as malformed and
rejected. When converting an HTX message to HTTP/3, these headers are
silently skipped.
This must be backported up to 2.6. Note that assignment to <h3s.err>
must be removed on 2.6 as stream level error has been introduced in 2.7
so this field does not exist in 2.6 A connection error will be used
instead automatically.
Pierre Cheynier reported a very rare race condition on soft-stop in the
listeners. What happens is that if a previously limited listener is
being resumed by another thread finishing an accept loop, and at the
same time a soft-stop is performed, the soft-stop will turn the
listener's state to LI_INIT, and once the listener's lock is released,
resume_listener() in the second thread will try to resume this listener
which has an fd==-1, yielding a crash in listener_set_state():
FATAL: bug condition "l->rx.fd == -1" matched at src/listener.c:288
The reason is that resume_listener() only checks for LI_READY, but doesn't
consider being called with a non-initialized or a stopped listener. Let's
also make sure we don't try to ressuscitate such a listener there.
This will have to be backported to all versions.
When the JWT token signature is using ECDSA algorithm (ES256 for
instance), the signature is a direct concatenation of the R and S
parameters instead of OpenSSL's DER format (see section
3.4 of RFC7518).
The code that verified the signatures wrongly assumed that they came in
OpenSSL's format and it did not actually work.
We now have the extra step of converting the signature into a complete
ECDSA_SIG that can be fed into OpenSSL's digest verification functions.
The ECDSA signatures in the regtest had to be recalculated and it was
made via the PyJWT python library so that we don't end up checking
signatures that we built ourselves anymore.
This patch should fix GitHub issue #2001.
It should be backported up to branch 2.5.
Existing logic for checking whether a regex subexpression for pathinfo
is matched results in valid matches being ignored and non-matches having
a new zero length string stored in params->pathinfo. This patch reverses
the logic so params->pathinfo is set when the subexpression is matched.
Without this patch the example configuration in the documentation:
path-info ^(/.+\.php)(/.*)?$
does not result in PATH_INFO being sent to the FastCGI application, as
expected, when the second subexpression is matched (in which case both
pmatch[2].rm_so and pmatch[2].rm_eo will be non-negative integers).
This patch may be backported as far as 2.2, the first release that made
the capture of this second subexpression optional.
This sample fetch returns a boolean. True if the support for QUIC transport
protocol was built and if this protocol was not disabled by "no-quic"
global option.
Must be backported to 2.7.
Add "no-quic" to "global" section to disable the use of QUIC transport protocol
by all configured QUIC listeners. This is listeners with QUIC addresses on their
"bind" lines. Internally, the socket addresses binding is skipped by
protocol_bind_all() for receivers with <proto_quic4> or <proto_quic6> as
protocol (see protocol struct).
Add information about "no-quic" global option to the documentation.
Must be backported to 2.7.
Set "disable_active_migration" transport parameter to inform the peer
haproxy listeners does not the connection migration feature.
Also drop all received datagrams with a modified source address.
Must be backported to 2.7.
In mux-h2 and mux-quic we still had two places manually setting
SE_FL_ERR_PENDING or SE_FL_ERROR depending on the EOS state, instead
of using se_fl_set_error() which takes care of the condition. Better
use the specialized function for this, it will allow to centralize
the conditions. Note that this will be needed to fix a bug.
While we do support quic4@ and quic6@ for listening addresses, it was
not possible to specify that we want to use an FD inherited from the
parent with QUIC. It's just a matter of making it possible to enable
a dgram-type socket and a stream-type transport, so let's add this.
Now it becomes possible to write "quic+fd@12", "quic+ipv4@addr" etc.
FDs inherited from a parent process do not deal well with suspend/resume
since commit 59b5da487 ("BUG/MEDIUM: listener: never suspend inherited
sockets") introduced in 2.3. The problem is that we now report that they
cannot be suspended at all, and they return a failure. As such, if a new
process fails to bind and sends SIGTTOU to the previous process, that
one will notice the failure and instantly switch to soft-stop, leaving
no chance to the new process to give up later and signal its failure.
What we need to do, however, is to stop receiving new connections from
such inherited FDs, which just means that the FD must be unsubscribed
from the poller (and resubscribed later if finally it has to stay).
With this a new process can start on the already bound FD without
problem thanks to the absence of polling, and when the old process
stops the new process will be alone on it.
This may be backported as far as 2.4.
Patrick Hemmer reported an interesting case where the status present in
the logs doesn't reflect what was reported to the user. During analysis
we could figure that it was in fact solely caused by the code dealing
with the set-status action. Indeed, set-status does update the status
in the HTX message itself but not in the HTTP transaction. However, at
most places where the status is needed to take a decision, it is
retrieved from the transaction, and the logs proceed like this as well,
though the "status" sample fetch function does retrieve it from the HTX
data. This particularly means that once a set-status has been used to
modify the status returned to the user, logs do not match that status,
and the response code distribution doesn't match either. However a
subsequent rule using the status as a condition will still match because
the "status" sample fetch function does also extract the status from the
HTX stream. Here's an example that fails:
frontend f
bind :8001
mode http
option httplog
log stdout daemon
http-after-response set-status 400
This will return a 400 to the client but log a 503 and increment
http_rsp_5xx.
In the end the root cause is that we need to make txn->status the only
authoritative place to get the status, and as such it must be updated
by the set-status rule. Ideally "status" should just use txn->status
but with the two synchronized this way it's not needed.
This should be backported since it addresses some consistency issues
between logs and what's observed. The set-status action appeared in
1.9 so all stable versions are eligible.
When the payload length cannot be determined, the htx extra field is set to
the magical vlaue ULLONG_MAX. It is not obvious. This a dedicated HTX value
is now used. Now, HTX_UNKOWN_PAYLOAD_LENGTH must be used in this case,
instead of ULLONG_MAX.
Since commit 473e0e54 ("BUG/MINOR: mux-h2: send a CANCEL instead of ES on
truncated writes"), a CANCEL may be reported when the response length is
unkown. It happens for H1 reponses without "Content-lenght" or
"Transfer-encoding" header.
Indeed, in this case, the end of the reponse is detected when the server
connection is closed. On the fontend side, the H2 multiplexer handles this
event as an abort and sensd a RST_STREAM frame with CANCEL error code.
The issue is not with the above commit but with the commit 4877045f1
("MINOR: mux-h2: make streams know if they need to send more data"). The
H2_SF_MORE_HTX_DATA flag must only be set if the payload length can be
determined.
This patch should fix the issue #1992. It must be backported to 2.7.
The error handling in the HTTP payload forwarding is far to be ideal because
both sides (request and response) are tested each time. It is espcially ugly
on the request side. To report a server error instead of a client error,
there are some workarounds to delay the error handling. The reason is that
the request analyzer is evaluated before the response one. In addition,
errors are tested before the data analysis. It means it is possible to
truncate data because errors may be handled to early.
So the error handling at this stages was totally reviewed. Aborts are now
handled after the data analysis. We also stop to finish the response on
request error or the opposite. As a side effect, the HTTP_MSG_ERROR state is
now useless. As another side effect, the termination flags are now set by
the HTTP analysers and not process_stream().
It was inherited from the legacy HTTP mode, but the message parsing is
handled by the underlying mux now. Thus, if a message is in HTTP_MSG_ERROR
state, it is just an analysis error and not a parsing error. So there is no
reason to block the HTTP sample fetch evaluation in this case.
This patch could be backported in all stable versions (For the 2.0, only the
htx part must be updated).
When HAProxy is waiting for the request body and an abort or an error is
detected, we can now use http_set_term_flags() function to set the termination
flags of the stream instead of handling it by hand.
When we wait for the request body, we are still in the request analysis. So
a SF_FINST_R flag must be reported in logs. Even if some data are already
received, at this staged, nothing is sent to the server.
This patch could be backported in all stable versions.
We use the new function to set the HTTP termination flags in the most
obvious places. The other places are a bit specific and will be handled one
by one in dedicated patched.
There is already a function to set termination flags but it is not well
suited for HTTP streams. So a function, dedicated to the HTTP analysis, was
added. This way, this new function will be called for HTTP analysers on
error. And if the error is not caugth at this stage, the generic function
will still be called from process_stream().
Here, by default a PRXCOND error is reported and depending on the stream
state, the reson will be set accordingly:
* If the backend SC is in INI state, SF_FINST_T is reported on tarpit and
SF_FINST_R otherwise.
* SF_FINST_Q is the server connection is queued
* SF_FINST_C in any connection attempt state (REQ/TAR/ASS/CONN/CER/RDY).
Except for applets, a SF_FINST_R is reported.
* Once the server connection is established, SF_FINST_H is reported while
HTTP_MSG_DATA state on the response side.
* SF_FINST_L is reported if the response is in HTTP_MSG_DONE state or
higher and a client error/timeout was reported.
* Otherwise SF_FINST_D is reported.
Since 2.6 with commit 34e4085f8 ("MEDIUM: peers: Balance applets across
threads") the initialization of a peers appctx may be postponed with a
wakeup, causing some partially initialized appctx to be visible. The
"show peers" command used to only care about peers without appctx, but
now it must also take care of those with no stconn, otherwise it can
occasionally crash while dumping them.
This fix must be backported to 2.6. Thanks to Patrick Hemmer for
reporting the problem.
During multiple tests we've already noticed that shared stats counters
have become a real bottleneck under large thread counts. With QUIC it's
pretty visible, with qc_snd_buf() taking 2.5% of the CPU on a 48-thread
machine at only 25 Gbps, and this CPU is entirely spent in the atomic
increment of the byte count and byte rate. It's also visible in H1/H2
but slightly less since we're working with larger buffers, hence less
frequent updates. These counters are exclusively used to report the
byte count in "show info" and the byte rate in the stats.
Let's move them to the thread_ctx struct and make the stats reader
just collect each thread's stats when requested. That's way more
efficient than competing on a single cache line.
After this, qc_snd_buf has totally disappeared from the perf profile
and tests made in h1 show roughly 1% performance increase on small
objects.
When updating an OCSP response, in case of HTTP error (host unreachable
for instance) we do not want to reinsert the entry at the same place in
the update tree otherwise we might retry immediately the update of the
same response. This patch adds an arbitrary 1min time to the next_update
of a response in such a case.
After an HTTP error, instead of waking the update task up after an
arbitrary 10s time, we look for the first entry of the update tree and
sleep for the apropriate time.
In the unlikely event that the ocsp update task is started but the
update tree is empty, put the update task to sleep indefinitely.
The only way this can happen is if the same certificate is loaded under
two different names while the second one has the 'ocsp-update on'
option. Since the certificate names are distinct we will have two
ckch_stores but a single certificate_ocsp because they are identified by
the OCSP_CERTID which is built out of the issuer certificate and the
certificate id (which are the same regardless of the .pem file name).
If incompatibilities are found in a certificate's ocsp-update mode we
raised a single alert that will be considered fatal from here on. This
is changed because in case of incompatibilities we will end up with an
undefined behaviour. The ocsp response might or might not be updated
depending on the order in which the multiple ocsp-update options are
taken into account.
An arbitrary 5 minutes minimum interval between two updates of the same
OCSP response is defined but it was not properly used when inserting
entries in the update tree.
This patch does not need to be backported.
Since commit 36d9097cf ("MINOR: fd: Add BUG_ON checks on fd_insert()"),
there is currently a test in fd_insert() to detect that we're not trying
to reinsert an FD that had already been inserted. This test catches the
following anomalies:
frontend fail1
bind fd@0
bind fd@0
and:
frontend fail2
bind fd@0 shards 2
What happens is that clone_listener() is called on a listener already
having an FD, and when sock_{inet,unix}_bind_receiver() are called, the
same FD will be registered multiple times and rightfully crash in the
sanity check.
It wouldn't be correct to block shards though (e.g. they could be used
in a default-bind line). What looks like a safer and more future-proof
approach simply is to dup() the FD so that each listener has one copy.
This is also the only solution that might allow later to support more
than 64 threads on an inherited FD.
This needs to be backported as far as 2.4. Better wait for at least
one extra -dev version before backporting though, as the bug should
not be triggered often anyway.
The do_resolv action triggers a resolution and must wait for the
result. Concretely, if no cache entry is available, it creates a resolution
and wakes up the resolvers task. Then it yields. When the action is
recalled, if the resolution is still running, it yields again.
However, if the resolution is not running, it does not check it was
running. Thus, it is possible to ignore the resolution because the action
was recalled before the resolvers task had a chance to be executed. If there
is result, the action must yield.
This patch should fix the issue #1993. It must be backported as far as 2.0.
These both functions are buggy and don't respect the documentation. They
must wait for more data, if possible.
For Channel.data(), it must happen if not enough data was received orf if no
length was specified and no data was received. The first case is properly
handled but not the second one. An empty string is return instead. In
addition, if there is no data and the channel can't receive more data, 'nil'
value must be returned.
In the same spirit, for Channel.line(), we must try to wait for more data
when no line is found if not enough data was received or if no length was
specified. Here again, only the first case is properly handled. And for this
function too, 'nil' value must be returned if there is no data and the
channel can't receive more data.
This patch is related to the issue #1993. It must be backported as far as
2.5.
It is possible to have an "upgrade:" header and the corresponding value in
the "connection:" header for a non-101 response. It happens for
426-Upgrade-Required messages. However, on HAProxy side, a parsing error is
reported for this kind of message because no websocket key header
("sec-websocket-accept:") is found in the response.
So a possible fix could be to not perform this test for non-101
responses. However, having flags about protocol upgrade on this kind of
response could lead to other bugs. Instead, corresponding flags are
removed. Thus, during the H1 response post-parsing, H1_MF_CONN_UPG and
H1_MF_UPG_WEBSOCKET flags are removed from any non-101 response.
This patch should fix the issue #1997. It must be backported as far as 2.4.
Sending is done with several iterations over qcs streams in qc_send().
The first loop is conducted over streams in <qcc.send_list>. After this
first iteration, some streams may still have data in their Tx buffer but
were blocked by a full qc_stream_desc buffer. In this case, they have
release their qc_stream_desc buffer in qcc_streams_sent_done().
New iterations can be done for these streams which can allocate new
qc_stream_desc buffer if available. Before this patch, this was done
through another stream list <qcc.send_retry_list>. Now, we can reuse the
new <qcc.send_list> for this usage.
This is safe to use as after first iteration, we have guarantee that
either one of the following is true if there is still streams in
<qcc.send_list> :
* transport layer has rejected data due to congestion
* stream is left because it is blocked on stream flow control
* stream still has data and has released a fulfilled qc_stream_desc
buffer. Immediate retry is useful for these streams : they will
allocate a new qc_stream_desc buffer if possible to continue sending.
This must be backported up to 2.7.
When a STOP_SENDING or RESET_STREAM must be send, its corresponding qcs
is inserted into <qcc.send_list> via qcc_reset_stream() or
qcc_abort_stream_read().
This allows to remove the iteration on full qcs tree in qc_send().
Instead, STOP_SENDING and RESET_STREAM is done in the loop over
<qcc.send_list> as with STREAM frames. This should improve slightly the
performance, most notably when large number of streams are opened.
This must be backported up to 2.7.
Complete qcc_send_stream() function to allow to specify if the stream
should be handled in priority. Internally this will insert the qcs
instance in front of <qcc.send_list> to be able to treat it before other
streams.
This functionality is useful when some QUIC streams should be sent
before others. Most notably, this is used to guarantee that H3 SETTINGS
is done first via the control stream.
This must be backported up to 2.7.
Implement a mechanism to register streams ready to send data in new
STREAM frames. Internally, this is implemented with a new list
<qcc.send_list> which contains qcs instances.
A qcs can be registered safely using the new function qcc_send_stream().
This is done automatically in qc_send_buf() which covers most cases.
Also, application layer is free to use it for internal usage streams.
This is currently the case for H3 control stream with SETTINGS sending.
The main point of this patch is to handle stream sending fairly. This is
in stark contrast with previous code where streams with lower ID were
always prioritized. This could cause other streams to be indefinitely
blocked behind a stream which has a lot of data to transfer. Now,
streams are handled in an order scheduled by se_desc layer.
This commit is the first one of a serie which will bring other
improvments which also relied on the send_list implementation.
This must be backported up to 2.7 when deemed sufficiently stable.
Add new traces when QUIC flow-control limits are reached at stream or
connection level. This may help to explain an interrupted transfer.
This should be backported up to 2.6.
QUIC stream did not transferred its response if it was an empty HTTP
response without headers nor entity body. This is caused by an
incomplete condition on qc_send() which skips streams with empty
<tx.buf>.
Fix this by extending the condition. Sending will be conducted on a
stream if <tx.buf> is not empty or FIN notification must be provided.
This allows to send the last STREAM frame for this stream.
Such HTTP responses should be extremely rare so this bug is labelled as
MINOR. It was encountered with a HTTP/0.9 request on an empty payload.
The bug was triggered as HTTP/0.9 does not support header in response
message.
Also, note that condition to wakeup MUX tasklet has been changed
similarly in qc_send_buf(). It is not mandatory to work properly
however, most probably because another tasklet_wakeup() is done
before/after.
This should be backported up to 2.6.
In applets, we stop processing when a write error (CF_WRITE_ERROR) or a shutdown
for writes (CF_SHUTW) is detected. However, any write error leads to an
immediate shutdown for writes. Thus, it is enough to only test if CF_SHUTW is
set.
When a read error (CF_READ_ERROR) is reported, a shutdown for reads is
always performed (CF_SHUTR). Thus, there is no reason to check if
CF_READ_ERROR is set if CF_SHUTR is also checked.
CF_READ_ATTACHED flag is only used in input events for stream analyzers,
CF_MASK_ANALYSER. A read event can be reported instead and this flag can be
removed. We must only take care to report a read event when the client
connection is upgraded from TCP to HTTP.
It appears CF_ANA_TIMEOUT is flag only used in CF_MASK_ANALYSER. All
analyzer timeout relies on the analysis expiration date (chn->analyse_exp).
Worst, once set, this flag is never removed. Thus this flag can be removed
and replaced by a read event (CF_READ_EVENT).
Thanks to previous changes, CF_WRITE_ACTIVITY flags can be removed.
Everywhere it was used, its value is now directly used
(CF_WRITE_EVENT|CF_WRITE_ERROR).
Thanks to previous changes, CF_READ_ACTIVITY flags can be removed.
Everywhere it was used, its value is now directly used
(CF_READ_EVENT|CF_READ_ERROR).
Just like CF_READ_PARTIAL, CF_WRITE_PARTIAL is now merged with
CF_WRITE_EVENT. There a subtlety in sc_notify(). The "connect" event
(formely CF_WRITE_NULL) is now detected with
(CF_WRITE_EVENT + sc->state < SC_ST_EST).
CF_READ_PARTIAL flag is now merged with CF_READ_EVENT. It means
CF_READ_EVENT is set when a read0 is received (formely CF_READ_NULL) or when
data are received (formely CF_READ_ACTIVITY).
There is nothing special here, except conditions to wake the stream up in
sc_notify(). Indeed, the test was a bit changed to reflect recent
change. read0 event is now formalized by (CF_READ_EVENT + CF_SHUTR).
As for CF_READ_NULL, it appears CF_WRITE_NULL and other write events on a
channel are mainly used to wake up the stream and may be replace by on write
event.
In this patch, we introduce CF_WRITE_EVENT flag as a replacement to
CF_WRITE_EVENT_NULL. There is no breaking change for now, it is just a
rename. Gradually, other write events will be merged with this one.
CF_READ_NULL flag is not really useful and used. It is a transient event
used to wakeup the stream. As we will see, all read events on a channel may
be resumed to only one and are all used to wake up the stream.
In this patch, we introduce CF_READ_EVENT flag as a replacement to
CF_READ_NULL. There is no breaking change for now, it is just a
rename. Gradually, other read events will be merged with this one.
If CF_READ_NULL flag is set on a channel, it implies a shutdown for reads
was performed and CF_SHUTR is also set on this channel. Thus, there is no
reason to test is any of these flags is present, testing CF_SHUTR is enough.
When calling 'update ssl ocsp-response' with an unknown certificate file
name, the error message would mention a "ckch_store" which is an
internal structure unknown by users.
The ocsp_uri field of the certificate_ocsp structure was a 16k buffer
when it could be hand allocated to just the required size to store the
OCSP uri. This field is now behaving the same way as the sctl and
ocsp_response buffers of the ckch_store structure.
If a given certificate is used multiple times in a configuration, the
ocsp_cid field would have been overwritten during each
ssl_sock_load_ocsp call even if it was previously filled.
This patch does not need to be backported.
If a configuration such as the following was included in a crt-list
file, it would not have raised a warning about 'ocsp-update'
inconsistencies for the concerned certificate:
cert.pem [ocsp-update on]
cert.pem
because the second line as a NULL entry->ssl_conf.
In the unlikely event that the OCSP udpate task is killed in the middle
of an update process (request sent but no response received yet) the
cur_ocsp member of the update context would keep an unneeded reference
to a certificate_ocsp object. It must then be freed during the task's
cleanup.
If the ocsp issuer certificate was actually taken from the certificate
chain in ssl_sock_load_ocsp, we don't need to keep an extra reference on
it since we already keep a reference to the full certificate chain.
When calling OCSP_basic_verify to check the validity of the received
OCSP response, we need to provide an untrusted certificate chain as well
as an X509_STORE holding only trusted certificates. Since the
certificate chain and the issuer certificate are all provided by the
user, we assume that they are valid and we add them all to a temporary
store. This enables to focus only on the response's validity.
When ocsp-update is enabled for a given certificate, its
certificate_ocsp objects is inserted in two separate trees (the actual
ocsp response one and the ocsp update one). But since the same instance
is used for the two trees, its ownership is kept by the regular ocsp
response one. The ocsp update task should then never have to free the
ocsp entries. The crash actually occurred because of this. The update
task was freeing entries whose reference counter was not increased while
a reference was still held by the SSL_CTXs.
The only time during which the ocsp update task will need to increase
the reference counter is during an actual update, because at this moment
the entry is taken out of the update tree and a 'flying' reference to
the certificate_ocsp is kept in the ocsp update context.
This bug could be reproduced by calling './haproxy -f conf.cfg -c' with
any of the used certificates having the 'ocsp-update on' option. For
some reason asan caught the bug easily but valgrind did not.
This patch does not need to be backported.
This CLI command crashed when called for a certificate which did not
have an OCSP response during startup because it assumed that the
ocsp_issuer pointer of the ckch_data object would be valid. It was only
true for already known OCSP responses though.
The ocsp issuer certificate is now taken either from the ocsp_issuer
pointer or looked for in the certificate chain. This is the same logic
as the one in ssl_sock_load_ocsp.
This patch does not need to be backported.
Released version 2.8-dev1 with the following main changes :
- MEDIUM: 51d: add support for 51Degrees V4 with Hash algorithm
- MINOR: debug: support pool filtering on "debug dev memstats"
- MINOR: debug: add a balance of alloc - free at the end of the memstats dump
- LICENSE: wurfl: clarify the dummy library license.
- MINOR: event_hdl: add event handler base api
- DOC/MINOR: api: add documentation for event_hdl feature
- MEDIUM: ssl: rename the struct "cert_key_and_chain" to "ckch_data"
- MINOR: quic: remove qc from quic_rx_packet
- MINOR: quic: complete traces in qc_rx_pkt_handle()
- MINOR: quic: extract datagram parsing code
- MINOR: tools: add port for ipcmp as optional criteria
- MINOR: quic: detect connection migration
- MINOR: quic: ignore address migration during handshake
- MINOR: quic: startup detect for quic-conn owned socket support
- MINOR: quic: test IP_PKTINFO support for quic-conn owned socket
- MINOR: quic: define config option for socket per conn
- MINOR: quic: allocate a socket per quic-conn
- MINOR: quic: use connection socket for emission
- MEDIUM: quic: use quic-conn socket for reception
- MEDIUM: quic: move receive out of FD handler to quic-conn io-cb
- MINOR: mux-quic: rename duplicate function names
- MEDIUM: quic: requeue datagrams received on wrong socket
- MINOR: quic: reconnect quic-conn socket on address migration
- MINOR: quic: activate socket per conn by default
- BUG/MINOR: ssl: initialize SSL error before parsing
- BUG/MINOR: ssl: initialize WolfSSL before parsing
- BUG/MINOR: quic: fix fd leak on startup check quic-conn owned socket
- BUG/MEDIIM: stconn: Flush output data before forwarding close to write side
- MINOR: server: add srv->rid (revision id) value
- MINOR: stats: add server revision id support
- MINOR: server/event_hdl: add support for SERVER_ADD and SERVER_DEL events
- MINOR: server/event_hdl: add support for SERVER_UP and SERVER_DOWN events
- BUG/MEDIUM: checks: do not reschedule a possibly running task on state change
- BUG/MINOR: checks: make sure fastinter is used even on forced transitions
- CLEANUP: assorted typo fixes in the code and comments
- MINOR: mworker: display an alert upon a wait-mode exit
- BUG/MEDIUM: mworker: fix segv in early failure of mworker mode with peers
- BUG/MEDIUM: mworker: create the mcli_reload socketpairs in case of upgrade
- BUG/MINOR: checks: restore legacy on-error fastinter behavior
- MINOR: check: use atomic for s->consecutive_errors
- MINOR: stats: properly handle ST_F_CHECK_DURATION metric
- MINOR: mworker: remove unused legacy code in mworker_cleanlisteners
- MINOR: peers: unused code path in process_peer_sync
- BUG/MINOR: init/threads: continue to limit default thread count to max per group
- CLEANUP: init: remove useless assignment of nbthread
- BUILD: atomic: atomic.h may need compiler.h on ARMv8.2-a
- BUILD: makefile/da: also clean Os/ in Device Atlas dummy lib dir
- BUG/MEDIUM: httpclient/lua: double LIST_DELETE on end of lua task
- CLEANUP: pools: move the write before free to the uaf-only function
- CLEANUP: pool: only include pool-os from pool.c not pool.h
- REORG: pool: move all the OS specific code to pool-os.h
- CLEANUP: pools: get rid of CONFIG_HAP_POOLS
- DEBUG: pool: show a few examples in -dMhelp
- MINOR: pools: make DEBUG_UAF a runtime setting
- BUG/MINOR: promex: create haproxy_backend_agg_server_status
- MINOR: promex: introduce haproxy_backend_agg_check_status
- DOC: promex: Add missing backend metrics
- BUG/MAJOR: fcgi: Fix uninitialized reserved bytes
- REGTESTS: fix the race conditions in iff.vtc
- CI: github: reintroduce openssl 1.1.1
- BUG/MINOR: quic: properly handle alloc failure in qc_new_conn()
- BUG/MINOR: quic: handle alloc failure on qc_new_conn() for owned socket
- CLEANUP: mux-quic: remove unused attribute on qcs_is_close_remote()
- BUG/MINOR: mux-quic: remove qcs from opening-list on free
- BUG/MINOR: mux-quic: handle properly alloc error in qcs_new()
- CI: github: split ssl lib selection based on git branch
- REGTESTS: startup: check maxconn computation
- BUG/MINOR: startup: don't use internal proxies to compute the maxconn
- REGTESTS: startup: change the expected maxconn to 11000
- CI: github: set ulimit -n to a greater value
- REGTESTS: startup: activate automatic_maxconn.vtc
- MINOR: sample: add param converter
- CLEANUP: ssl: remove check on srv->proxy
- BUG/MEDIUM: freq-ctr: Don't compute overshoot value for empty counters
- BUG/MEDIUM: resolvers: Use tick_first() to update the resolvers task timeout
- REGTESTS: startup: add alternatives values in automatic_maxconn.vtc
- BUG/MEDIUM: h3: reject request with invalid header name
- BUG/MEDIUM: h3: reject request with invalid pseudo header
- MINOR: http: extract content-length parsing from H2
- BUG/MEDIUM: h3: parse content-length and reject invalid messages
- CI: github: remove redundant ASAN loop
- CI: github: split matrix for development and stable branches
- BUG/MEDIUM: mux-h1: Don't release H1 stream upgraded from TCP on error
- BUG/MINOR: mux-h1: Fix test instead a BUG_ON() in h1_send_error()
- MINOR: http-htx: add BUG_ON to prevent API error on http_cookie_register
- BUG/MEDIUM: h3: fix cookie header parsing
- BUG/MINOR: h3: fix memleak on HEADERS parsing failure
- MINOR: h3: check return values of htx_add_* on headers parsing
- MINOR: ssl: Remove unneeded buffer allocation in show ocsp-response
- MINOR: ssl: Remove unnecessary alloc'ed trash chunk in show ocsp-response
- BUG/MINOR: ssl: Fix memory leak of find_chain in ssl_sock_load_cert_chain
- MINOR: stats: provide ctx for dumping functions
- MINOR: stats: introduce stats field ctx
- BUG/MINOR: stats: fix show stat json buffer limitation
- MINOR: stats: make show info json future-proof
- BUG/MINOR: quic: fix crash on PTO rearm if anti-amplification reset
- BUILD: 51d: fix build issue with recent compilers
- REGTESTS: startup: disable automatic_maxconn.vtc
- BUILD: peers: peers-t.h depends on stick-table-t.h
- BUG/MEDIUM: tests: use tmpdir to create UNIX socket
- BUG/MINOR: mux-h1: Report EOS on parsing/internal error for not running stream
- BUG/MINOR:: mux-h1: Never handle error at mux level for running connection
- BUG/MEDIUM: stats: Rely on a local trash buffer to dump the stats
- OPTIM: pool: split the read_mostly from read_write parts in pool_head
- MINOR: pool: make the thread-local hot cache size configurable
- MINOR: freq_ctr: add opportunistic versions of swrate_add()
- MINOR: pool: only use opportunistic versions of the swrate_add() functions
- REGTESTS: ssl: enable the ssl_reuse.vtc test for WolfSSL
- BUG/MEDIUM: mux-quic: fix double delete from qcc.opening_list
- BUG/MEDIUM: quic: properly take shards into account on bind lines
- BUG/MINOR: quic: do not allocate more rxbufs than necessary
- MINOR: ssl: Add a lock to the OCSP response tree
- MINOR: httpclient: Make the CLI flags public for future use
- MINOR: ssl: Add helper function that extracts an OCSP URI from a certificate
- MINOR: ssl: Add OCSP request helper function
- MINOR: ssl: Add helper function that checks the validity of an OCSP response
- MINOR: ssl: Add "update ssl ocsp-response" cli command
- MEDIUM: ssl: Add ocsp_certid in ckch structure and discard ocsp buffer early
- MINOR: ssl: Add ocsp_update_tree and helper functions
- MINOR: ssl: Add crt-list ocsp-update option
- MINOR: ssl: Store 'ocsp-update' mode in the ckch_data and check for inconsistencies
- MEDIUM: ssl: Insert ocsp responses in update tree when needed
- MEDIUM: ssl: Add ocsp update task main function
- MEDIUM: ssl: Start update task if at least one ocsp-update option is set to on
- DOC: ssl: Add documentation for ocsp-update option
- REGTESTS: ssl: Add tests for ocsp auto update mechanism
- MINOR: ssl: Move OCSP code to a dedicated source file
- BUG/MINOR: ssl/ocsp: check chunk_strcpy() in ssl_ocsp_get_uri_from_cert()
- CLEANUP: ssl/ocsp: add spaces around operators
- BUG/MEDIUM: mux-h2: Refuse interim responses with end-stream flag set
- BUG/MINOR: pool/stats: Use ullong to report total pool usage in bytes in stats
- BUG/MINOR: ssl/ocsp: httpclient blocked when doing a GET
- MINOR: httpclient: don't add body when istlen is empty
- MEDIUM: httpclient: change the default log format to skip duplicate proxy data
- BUG/MINOR: httpclient/log: free of invalid ptr with httpclient_log_format
- MEDIUM: mux-quic: implement shutw
- MINOR: mux-quic: do not count stream flow-control if already closed
- MINOR: mux-quic: handle RESET_STREAM reception
- MEDIUM: mux-quic: implement STOP_SENDING emission
- MINOR: h3: use stream error when needed instead of connection
- CI: github: enable github api authentication for OpenSSL tags read
- BUG/MINOR: mux-quic: ignore remote unidirectional stream close
- CI: github: use the GITHUB_TOKEN instead of a manually generated token
- BUILD: makefile: build the features list dynamically
- BUILD: makefile: move common options-oriented macros to include/make/options.mk
- BUILD: makefile: sort the features list
- BUILD: makefile: initialize all build options' variables at once
- BUILD: makefile: add a function to collect all options' CFLAGS/LDFLAGS
- BUILD: makefile: start to automatically collect CFLAGS/LDFLAGS
- BUILD: makefile: ensure that all USE_* handlers appear before CFLAGS are used
- BUILD: makefile: clean the wolfssl include and lib generation rules
- BUILD: makefile: make sure to also ignore SSL_INC when using wolfssl
- BUILD: makefile: reference libdl only once
- BUILD: makefile: make sure LUA_INC and LUA_LIB are always initialized
- BUILD: makefile: do not restrict Lua's prepend path to empty LUA_LIB_NAME
- BUILD: makefile: never force -latomic, set USE_LIBATOMIC instead
- BUILD: makefile: add an implicit USE_MATH variable for -lm
- BUILD: makefile: properly report USE_PCRE/USE_PCRE2 in features
- CLEANUP: makefile: properly indent ifeq/ifneq conditional blocks
- BUILD: makefile: rework 51D to split v3/v4
- BUILD: makefile: support LIBCRYPT_LDFLAGS
- BUILD: makefile: support RT_LDFLAGS
- BUILD: makefile: support THREAD_LDFLAGS
- BUILD: makefile: support BACKTRACE_LDFLAGS
- BUILD: makefile: support SYSTEMD_LDFLAGS
- BUILD: makefile: support ZLIB_CFLAGS and ZLIB_LDFLAGS
- BUILD: makefile: support ENGINE_CFLAGS
- BUILD: makefile: support OPENSSL_CFLAGS and OPENSSL_LDFLAGS
- BUILD: makefile: support WOLFSSL_CFLAGS and WOLFSSL_LDFLAGS
- BUILD: makefile: support LUA_CFLAGS and LUA_LDFLAGS
- BUILD: makefile: support DEVICEATLAS_CFLAGS and DEVICEATLAS_LDFLAGS
- BUILD: makefile: support PCRE[2]_CFLAGS and PCRE[2]_LDFLAGS
- BUILD: makefile: refactor support for 51DEGREES v3/v4
- BUILD: makefile: support WURFL_CFLAGS and WURFL_LDFLAGS
- BUILD: makefile: make all OpenSSL variants use the same settings
- BUILD: makefile: remove the special case of the SSL option
- BUILD: makefile: only consider settings from enabled options
- BUILD: makefile: also list per-option settings in 'make opts'
- BUG/MINOR: debug: don't mask the TH_FL_STUCK flag before dumping threads
- MINOR: cfgparse-ssl: avoid a possible crash on OOM in ssl_bind_parse_npn()
- BUG/MINOR: ssl: Missing goto in error path in ocsp update code
- BUG/MINOR: stick-table: report the correct action name in error message
- CI: Improve headline in matrix.py
- CI: Add in-memory cache for the latest OpenSSL/LibreSSL
- CI: Use proper `if` blocks instead of conditional expressions in matrix.py
- CI: Unify the `GITHUB_TOKEN` name across matrix.py and vtest.yml
- CI: Explicitly check environment variable against `None` in matrix.py
- CI: Reformat `matrix.py` using `black`
- MINOR: config: add environment variables for default log format
- REGTESTS: Remove REQUIRE_VERSION=1.9 from all tests
- REGTESTS: Remove REQUIRE_VERSION=2.0 from all tests
- REGTESTS: Remove tests with REQUIRE_VERSION_BELOW=1.9
- BUG/MINOR: http-fetch: Only fill txn status during prefetch if not already set
- BUG/MAJOR: buf: Fix copy of wrapping output data when a buffer is realigned
- DOC: config: fix alphabetical ordering of http-after-response rules
- MINOR: http-rules: Add missing actions in http-after-response ruleset
- DOC: config: remove duplicated "http-response sc-set-gpt0" directive
- BUG/MINOR: proxy: free orgto_hdr_name in free_proxy()
- REGTEST: fix the race conditions in json_query.vtc
- REGTEST: fix the race conditions in add_item.vtc
- REGTEST: fix the race conditions in digest.vtc
- REGTEST: fix the race conditions in hmac.vtc
- BUG/MINOR: fd: avoid bad tgid assertion in fd_delete() from deinit()
- BUG/MINOR: http: Memory leak of http redirect rules' format string
- MEDIUM: stick-table: set the track-sc limit at boottime via tune.stick-counters
- MINOR: stick-table: implement the sc-add-gpc() action
This action increments the General Purpose Counter at the index <idx> of
the array associated to the sticky counter designated by <sc-id> by the
value of either integer <int> or the integer evaluation of expression
<expr>. Integers and expressions are limited to unsigned 32-bit values.
If an error occurs, this action silently fails and the actions evaluation
continues. <idx> is an integer between 0 and 99 and <sc-id> is an integer
between 0 and 2. It also silently fails if the there is no GPC stored at
this index. The entry in the table is refreshed even if the value is zero.
The 'gpc_rate' is automatically adjusted to reflect the average growth
rate of the gpc value.
The main use of this action is to count scores or total volumes (e.g.
estimated danger per source IP reported by the server or a WAF, total
uploaded bytes, etc).
The number of stick-counter entries usable by track-sc rules is currently
set at build time. There is no good value for this since the vast majority
of users don't need any, most need only a few and rare users need more.
Adding more counters for everyone increases memory and CPU usages for no
reason.
This patch moves the per-session and per-stream arrays to a pool of a size
defined at boot time. This way it becomes possible to set the number of
entries at boot time via a new global setting "tune.stick-counters" that
sets the limit for the whole process. When not set, the MAX_SESS_STR_CTR
value still applies, or 3 if not set, as before.
It is also possible to lower the value to 0 to save a bit of memory if
not used at all.
Note that a few low-level sample-fetch functions had to be protected due
to the ability to use sample-fetches in the global section to set some
variables.
When the configuration contains such a line:
http-request redirect location /
a "struct logformat_node" object is created and it contains an "arg"
member which gets alloc'ed as well in which we copy the new location
(see add_to_logformat_list). This internal arg pointer was not freed in
the dedicated release_http_redir release function.
Likewise, the expression pointer was not released as well.
This patch can be backported to all stable branches. It should apply
as-is all the way to 2.2 but it won't on 2.0 because release_http_redir
did not exist yet.
In 2.7, commit 0dc1cc93b ("MAJOR: fd: grab the tgid before manipulating
running") added a check to make sure we never try to delete an FD from
the wrong thread group. It already handles the specific case of an
isolated thread (e.g. stop a listener from the CLI) but forgot to take
into account the deinit() code iterating over all idle server connections
to close them. This results in the crash below during deinit() if thread
groups are enabled and idle connections exist on a thread group higher
than 1.
[WARNING] (15711) : Proxy decrypt stopped (cumulated conns: FE: 64, BE: 374511).
[WARNING] (15711) : Proxy stats stopped (cumulated conns: FE: 0, BE: 0).
[WARNING] (15711) : Proxy GLOBAL stopped (cumulated conns: FE: 0, BE: 0).
FATAL: bug condition "fd_tgid(fd) != ti->tgid && !thread_isolated()" matched at src/fd.c:369
call trace(11):
| 0x4a6060 [c6 04 25 01 00 00 00 00]: main-0x1d60
| 0x67fcc6 [c7 43 68 fd ad de fd 5b]: sock_conn_ctrl_close+0x16/0x1f
| 0x59e6f5 [48 89 ef e8 83 65 11 00]: main+0xf6935
| 0x60ad16 [48 8b 1b 48 81 fb a0 91]: free_proxy+0x716/0xb35
| 0x62750e [48 85 db 74 35 48 89 dd]: deinit+0xbe/0x87a
| 0x627ce2 [89 ef e8 97 76 e7 ff 0f]: deinit_and_exit+0x12/0x19
| 0x4a9694 [bf e6 ff 9d 00 44 89 6c]: main+0x18d4/0x2c1a
There's no harm though since all traffic already ended. This must be
backported to 2.7.
Unlike fwdfor_hdr_name, orgto_hdr_name was not properly freed in
free_proxy().
This did not cause observable memory leaks because originalto proxy option is
only used for user configurable proxies, which are solely freed right before
process termination.
No backport needed unless some architectural changes causing regular proxies
to be freed and reused multiple times within single process lifetime are made.
This patch adds the support of following actions in the http-after-response
ruleset:
* set-map, del-map and del-acl
* set-log-level
* sc-inc-gpc, sc-inc-gpc0 and set-inc-gpc1
* sc-inc-gpt and sc-set-gpt0
This patch should solve the issue #1980.
When an HTTP sample fetch is evaluated, a prefetch is performed to check the
channel contains a valid HTTP message. If the HTTP analysis was not already
started, some info are filled.
It may be an issue when an error is returned before the response analysis
and when http-after-response rules are used because the original HTTP txn
status may be crushed. For instance, with the following configuration:
listen l1
log global
mode http
bind :8000
log-format ST=%ST
http-after-response set-status 400
#http-after-response set-var(res.foo) status
A "ST=503" is reported in the log messages, independantly on the first
http-after-response rule. The same must happen if the second rule is
uncommented. However, for now, a "ST=400" is logged.
To fix the bug, during the prefetch, the HTTP txn status is only set if it
is undefined (-1). This way, we are sure the original one is never lost.
This patch should be backported as far as 2.2.
This patch provides a convenient way to override the default TCP, HTTP
and HTTP log formats. Instead of having a look into the documentation
to figure out what is the appropriate default log format three new
environment variables can be used: HAPROXY_TCP_LOG_FMT,
HAPROXY_HTTP_LOG_FMT and HAPROXY_HTTPS_LOG_FMT. Their content are
substituted verbatim.
These variables are set before parsing the configuration and are unset
just after all configuration files are successful parsed.
Example:
# Instead of writing this long log-format line...
log-format "%ci:%cp [%tr] %ft %b/%s %TR/%Tw/%Tc/%Tr/%Ta %ST %B %CC \
%CS %tsc %ac/%fc/%bc/%sc/%rc %sq/%bq %hr %hs %{+Q}r \
lr=last_rule_file:last_rule_line"
# ..the HAPROXY_HTTP_LOG_FMT can be used to provide the default
# http log-format string
log-format "${HAPROXY_HTTP_LOG_FMT} lr=last_rule_file:last_rule_line"
Please note that nothing prevents users to unset the variables or
override their content in a global section.
Signed-off-by: Sébastien Gross <sgross@haproxy.com>
sc-inc-gpc() learned to use arrays in 2.5 with commit 4d7ada8f9 ("MEDIUM:
stick-table: add the new arrays of gpc and gpc_rate"), but the error
message says "sc-set-gpc" instead of "sc-inc-gpc". Let's fix this to
avoid confusion.
This can be backported to 2.5.
When converting an OCSP request's information into base64, the return
value of a2base64 is checked but processing is not interrupted when it
returns a negative value, which was caught by coverity.
This patch fixes GitHub issue #1974.
It does not need to be backported.
Upon out of memory condition at boot, we could possibly crash when
parsing the "npn" bind line keyword since it's used unchecked. There's
no real need to backport this though it will not hurt.
Commit f0c86ddfe ("BUG/MEDIUM: debug: fix parallel thread dumps again")
added a clearing of the TH_FL_STUCK flag before dumping threads in case
of parallel dumps, but that was in part a sort of workaround for some
remains of the commit that introduced the flag in 2.0 before the watchdog
existed, and which would set it after dumping a thread: e6a02fa65 ("MINOR:
threads: add a "stuck" flag to the thread_info struct"), and in part an
attempt to avoid that a thread waiting for too long during the dump would
get the flag set. But that is not possible, a thread waiting for being
dumped has the harmless bit set and doesn't get the stuck bit. What happens
in fact is that issuing "show threads" in fast loops ends up causing some
threads to keep their STUCK bit that was set at the end of "show threads",
and confuses the output.
The problem with doing this is that the flag is cleared before the thread
is dumped, and since this flag is used to decide whether to show a backtrace
or not, we don't get backtraces anymore of stuck threads since the commit
above in 2.7.
This patch just removes the two points where the flag was cleared by the
commit above. It should be backported to 2.7.
Remove ABORT_NOW() on remote unidirectional stream closure. This is
required to ensure our implementation is evolutive enough to not fail on
unknown stream type.
Note that for the moment MAX_STREAMS_UNI flow-control frame is never
emitted. This should be unnecessary for HTTP/3 which have a limited
usage of unidirectional streams but may be required if other application
protocols are supported in the future.
ABORT_NOW() was triggered by s2n-quic which opens an unknown
unidirectional stream with greasing. This was detected by QUIC interop
runner for http3 testcase.
This must be backported up to 2.6.
Use a stream error when possible instead of always closing the whole
connection. This requires a new field <err> in h3s structure.
Change slightly the decoding loop to facilitate error propagation. It
will be interrupted as soon as <h3s.err> or <h3c.err> is non null. In
the later case, a CONNECTION_CLOSE is requested through
qcc_emit_cc_app().
For stream error, H3 layer uses qcc_abort_stream_read() coupled with
qcc_reset_stream(). This is in conformance with RFC 9114 which
recommends to use STOP_SENDING + RESET_STREAM emission on stream error.
This commit is part of implementing H3 errors at the stream level.
This should be backported up to 2.7.
Implement STOP_SENDING. This is divided in two main functions :
* qcc_abort_stream_read() which can be used by application protocol to
request for a STOP_SENDING. This set the flag QC_SF_READ_ABORTED.
* qcs_send_reset() is a static function called after the preceding one.
It will send a STOP_SENDING via qcc_send().
QC_SF_READ_ABORTED flag is now properly used : if activated on a stream
during qcc_recv(), <qcc.app_ops.decode_qcs> callback is skipped. Also,
abort reading on unknown unidirection remote stream is now fully
supported with the emission of a STOP_SENDING as specified by RFC 9000.
This commit is part of implementing H3 errors at the stream level. This
will allows the H3 layer to request the peer to close its endpoint for
an error on a stream.
This should be backported up to 2.7.
Implement RESET_STREAM reception by mux-quic. On reception, qcs instance
will be mark as remotely closed and its Rx buffer released. The stream
layer will be flagged on error if still attached.
This commit is part of implementing H3 errors at the stream level.
Indeed, on H3 stream errors, STOP_SENDING + RESET_STREAM should be
emitted. The STOP_SENDING will in turn generate a RESET_STREAM by the
remote peer which will be handled thanks to this patch.
This should be backported up to 2.7.
It is unnecessary to increase stream credit once its size is known.
Indeed, a peer cannot sent a greater offset than the value advertized.
Else, connection will be closed on STREAM reception with
FINAL_SIZE_ERROR.
This commit is a small optimization and may prevent the emission of
unneeded MAX_STREAM_DATA frames on some occasions.
It should be backported up to 2.7.