Implement the Kathleen Nichols' algorithm used by several congestion control
algorithm implementation (TCP/BBR in Linux kernel, QUIC/BBR in quiche) to track
the maximum value of a data type during a fixe time interval.
In this implementation, counters which are periodically reset are used in place
of timestamps.
Only the max part has been implemented.
(see lib/minmax.c implemenation for Linux kernel).
Add ->initial_wnd new member to quic_cc_path struct to keep the initial value
of the congestion window. This member is initialized as soon as a QUIC connection
is allocated. This modification is required for BBR congestion control algorithm.
Once in a while we find some makefiles ignoring some outdated arguments
and just emit a warning. What's annoying is that if users (say, distro
packagers), have purposely added ERR=1 to their build scripts to make
sure to fail on any warning, these ones will be ignored and the build
can continue with invalid or missing options.
William rightfully suggested that ERR=1 should also catch make's warnings
so this patch implements this, by creating a new "complain" variable that
points either to "error" or "warning" depending on $(ERR), and that is
used to send the messages using $(call $(complain),...). This does the
job right at little effort (tested from GNU make 3.82 to 4.3).
Note that for this purpose the ERR declaration was upped in the makefile
so that it appears before the new errors.mk file is included.
tasklet_wakeup_on() and its derivates (tasklet_wakeup_after() and
tasklet_wakeup()) do not support passing a wakeup cause like
task_wakeup(). This is essentially due to an API limitation cause by
the fact that for a very long time the only reason for waking up was
to process pending I/O. But with the growing complexity of mux tasks,
it is becoming important to be able to skip certain heavy processing
when not strictly needed.
One possibility is to permit the caller of tasklet_wakeup() to pass
flags like task_wakeup(). Instead of going with a complex naming scheme,
let's simply make the flags optional and be zero when not specified. This
means that tasklet_wakeup_on() now takes either 2 or 3 args, and that the
third one is the optional flags to be passed to the callee. Eligible flags
are essentially the non-persistent ones (TASK_F_UEVT* and TASK_WOKEN_*)
which are cleared when the tasklet is executed. This way the handler
will find them in its <state> argument and will be able to distinguish
various causes for the call.
Everything in the tasklet layer supports flags, except that they are
just not implemented in the wakeup functions, while they are in the
task_wakeup functions. Initially it was not considered useful to pass
wakeup causes because these were essentially I/O, but with the growing
number of I/O handlers having to deal with various types of operations
(typically cheap I/O notifications on subscribe vs heavy parsing on
application-level wakeups), it would be nice to start to make this
distinction possible.
This commit extends _tasklet_wakeup_on() and _tasklet_wakeup_after()
to pass a set of flags that continues to be set as zero. For now this
changes nothing, but new functions will come.
Currently tasks being profiled have th_ctx->sched_call_date set to the
current nanosecond in monotonic time. But there's no other way to have
this, despite the scheduler being capable of it. Let's just declare a
new task flag, TASK_F_WANTS_TIME, that makes the scheduler take the time
just before calling the handler. This way, a task that needs nanosecond
resolution on the call date will be able to be called with an up-to-date
date without having to abuse now_mono_time() if not needed. In addition,
if CLOCK_MONOTONIC is not supported (now_mono_time() always returns 0),
the date is set to the most recently known now_ns, which is guaranteed
to be atomic and is only updated once per poll loop.
This date can be more conveniently retrieved using task_mono_time().
This can be useful, e.g. for pacing. The code was slightly adjusted so
as to merge the common parts between the profiling case and this one.
We used to store it in 32-bits since we'd only use it for latency and CPU
usage calculation but usages will evolve so let's not truncate the value
anymore. Now we store the full 64 bits. Note that this doesn't even
increase the storage size due to alignment. The 3 usage places were
verified to still be valid (most were already cast to 32 bits anyway).
To improve debugging, extend "show quic" output to report if pacing is
activated on a connection. Two values will be displayed for pacing :
* a new counter paced_sent_ctr is defined in QCC structure. It will be
incremented each time an emission is interrupted due to pacing.
* pacing engine now saves the number of datagrams sent in the last paced
emission. This will be helpful to ensure burst parameter is valid.
Define a new QUIC congestion algorithm token 'cubic-pacing' for
quic-cc-algo bind keyword. This is identical to default cubic
implementation, except that pacing is used for STREAM frames emission.
This algorithm supports an extra argument to specify a burst size. This
is stored into a new bind_conf member named quic_pacing_burst which can
be reuse to initialize quic path.
Pacing support is still considered experimental. As such, 'cubic-pacing'
can only be used with expose-experimental-directives set.
Support pacing emission for STREAM frames at the QUIC MUX layer. This is
implemented by adding a quic_pacer engine into QCC structure.
The main changes have been written into qcc_io_send(). It now
differentiates cases when some frames have been rejected by transport
layer. This can occur as previously due to congestion or FD buffer full,
which requires subscribing on transport layer. The new case is when
emission has been interrupted due to pacing timing. In this case, QUIC
MUX I/O tasklet is rescheduled to run with the flag TASK_F_USR1.
On tasklet execution, if TASK_F_USR1 is set, all standard processing for
emission and reception is skipped. Instead, a new function
qcc_purge_sending() is called. Its purpose is to retry emission with the
saved STREAM frames list. Either all remaining frames can now be send,
subscribe is done on transport error or tasklet must be rescheduled for
pacing purging.
In the meantime, if tasklet is rescheduled due to other conditions,
TASK_F_USR1 is reset. This will trigger a full regeneration of STREAM
frames. In this case, pacing expiration must be check before calling
qcc_send_frames() to ensure emission is now allowed.
QUIC MUX will be responsible to drive emission with pacing. This will be
implemented via setting TASK_F_USR1 before I/O tasklet wakeup. To
prepare this, encapsulate each I/O tasklet wakeup into a new function
qcc_wakeup().
This commit is purely refactoring prior to pacing implementation into
QUIC MUX.
For STREAM emission, MUX QUIC previously used a local list defined under
qcc_io_send(). This was suitable as either all frames were sent, or
emission must be interrupted due to transport congestion or fatal error.
In the latter case, the list was emptied anyway and a new frame list was
built on future qcc_io_send() invokation.
For pacing, MUX QUIC may have to save the frame list if pacing should be
applied across emission. This is necessary to avoid to unnecessarily
rebuilt stream frame list between each paced emission. To support this,
STREAM list is now stored as a member of QCC structure.
Ensure frame list is always deleted, even on QCC release, using newly
defined utility function qcc_tx_frms_free().
qc_send_mux() has been extended previously to support pacing emission.
This will ensure that no more than one datagram will be emitted during
each invokation. However, to achieve better performance, it may be
necessary to emit a batch of several datagrams one one turn.
A so-called burst value can be specified by the user in the
configuration. However, some congestion control algos may defined their
owned dynamic value. As such, a new CC callback pacing_burst is defined.
quic_cc_default_pacing_burst() can be used for algo without pacing
interaction, such as cubic. It will returns a static value based on user
selected configuration.
Pacing will be implemented for STREAM frames emission. As such,
qc_send_mux() API has been extended to add an argument to a quic_pacer
engine.
If non NULL, engine will be used to pace emission. In short, no more
than one datagram will be emitted for each qc_send_mux() invokation.
Pacer is then notified about the emission and a timer for a future
emission is calculated. qc_send_mux() will return PACING error value, to
inform QUIC MUX layer that it will be responsible to retry emission
after some delay.
Extend quic_pacer engine to support pacing emission. Several functions
are defined.
* quic_pacing_sent_done() to notify engine about an emission of one or
several datagrams
* quic_pacing_expired() to check if emission should be delayed or can be
conducted immediately
This commit is part of a adjustment on QUIC transport send API to
support pacing. Here, qc_send_mux() return type has been changed to use
a new enum quic_tx_err.
This is useful to explain different failure causes of emission. For now,
only two values have been defined : NONE and FATAL. When pacing will be
implemented, a new value would be added to specify that emission was
interrupted on pacing. This won't be a fatal error as this allows to
retry emission but not immediately.
Extend QUIC transport emission function to support a maximum datagram
argument. The purpose is to ensure that qc_send() won't emit more than
the specified value, unless it is 0 which is considered as unlimited.
In qc_prep_pkts(), a counter of built datagram has been added to support
this. The packet building loop is interrupted if it reaches a specified
maximum value. Also, its return value has been changed to the number of
prepared datagrams. This is reused by qc_send() to interrupt its work if
a specified max datagram argument value is reached over one or several
iteration of prepared/sent datagrams.
This change is necessary to support pacing emission. Note that ideally,
the total length in bytes of emitted datagrams should be taken into
account instead of the raw number of datagrams. However, for a first
implementation, it was deemed easier to implement it with the latter.
Add QCC QC_CF_WAIT_FOR_HS and QCS QC_SF_TXBUB_OOB flags to their
respective show_flags to be able to decipher them via dev flags utility.
These values have been added in the current dev version, thus no need to
backport this patch.
When the request is too large to fit in a buffer a 414 or a 431 error
message is returned depending on the error state of the request parser. A
414 is returned if the URI is too long, otherwise a 431 is returned.
This patch should fix the issue #1309.
414-Uri-Too-Long and 431-Request-Header-Fields-Too-Large are now part of
supported status codes that can be define as error files. The hash table
defined in http_get_status_idx() was updated accordingly.
It is now possible to use a log-format string to define the "Set-Cookie"
header value of a response generated by a redirect rule. There is no special
check on the result format and it is not possible during the configuration
parsing. It is proably not a big deal because already existing "set-cookie"
and "clear-cookie" options don't perform any check.
Here is an example:
http-request redirect location https://someurl.com/ set-cookie haproxy="%[var(txn.var)]"
This patch should fix the issue #1784.
On prefix-based redirect, there is an option to drop the query-string of the
location. Here it is the opposite. an option is added to preserve the
query-string of the original URI for a localtion-based redirect.
By setting "keep-query" option, for a location-based redirect only, the
query-string of the original URI is appended to the location. If there is no
query-string, nothing is added (no empty '?'). If there is already a
non-empty query-string on the localtion, the original one is appended with
'&' separator.
This patch should fix issue #2728.
parse_size_err() currently is a function working only on an uint. It's
not convenient for certain elements such as rings on large machines.
This commit addresses this by having one function for uints and one
for ullong, and making parse_size_err() a macro that automatically
calls one or the other. It also has the benefit of automatically
supporting compatible types (long, size_t etc).
From time to time we face a configuration with very small timeouts which
look accidental because there could be expectations that they're expressed
in seconds and not milliseconds.
This commit adds a check for non-nul unitless values smaller than 100
and emits a warning suggesting to append an explicit unit if that was
the intent.
Only the common timeouts, the server check intervals and the resolvers
hold and timeout values were covered for now. All the code needs to be
manually reviewed to verify if it supports emitting warnings.
This may break some configs using "zero-warning", but greps in existing
configs indicate that these are extremely rare and solely intentionally
done during tests. At least even if a user leaves that after a test, it
will be more obvious when reading 10ms that something's probably not
correct.
Till now this value was parsed as raw integer using atol() and would
silently ignore any trailing suffix, causing unexpected behaviors when
set, e.g. to "4k". Let's make use of parse_size_err() on it so that
units are supported. This requires to turn it to uint as well, which
was verified to be OK.
Till now this value was parsed as raw integer using atol() and would
silently ignore any trailing suffix, preventing from starting when set
e.g. to "64k". Let's make use of parse_size_err() on it so that units are
supported. This requires to turn it to uint as well, and to explicitly
limit its range to INT_MAX - 2*sizeof(void*), which was previously
partially handled as part of the sign check.
Till now this value was parsed as raw integer using atol() and would
silently ignore any trailing suffix, causing unexpected behaviors when
set, e.g. to "512k". Let's make use of parse_size_err() on it so that
units are supported. This requires to turn it to uint as well, and
since it's sometimes compared to an int, we limit its range to
0..INT_MAX.
Till now this value was parsed as raw integer using atol() and would
silently ignore any trailing suffix, causing unexpected behaviors when
set, e.g. to "512k". Let's make use of parse_size_err() on it so that
units are supported. This requires to turn it to uint as well, which
was verified to be OK.
Till now these values were parsed as raw integer using atol() and would
silently ignore any trailing suffix, causing unexpected behaviors when
set, e.g. to "512k". Let's make use of parse_size_err() on them so that
units are supported. This requires to turn them to uint as well, which
is OK.
Till now these values were parsed as raw integer using atol() and would
silently ignore any trailing suffix, causing unexpected behaviors when
set, e.g. to "512k". Let's make use of parse_size_err() on them so that
units are supported. This requires to turn them to uint as well, which
is OK.
The qcc_report_glitch() function is now replaced with a macro to support
enumerating counters for each individual glitch line. For now this adds
36 such counters. The macro supports an optional description, though that
is not being used for now.
As a reminder, this requires to build with -DDEBUG_GLITCHES=1.
proxy auth_uri struct was manually cleaned up during deinit, but the logic
behind was kind of akward because it was required to find out which ones
were shared or not. Instead, let's switch to a proper refcount mechanism
and free the auth_uri struct directly in proxy_free_common().
COUNT_GLITCH() will implement an unconditional counter on its declaration
line when DEBUG_GLITCHES is set, and do nothing otherwise. The output will
be reported as "GLT" and can be filtered as "glt" on the CLI. The purpose
is to help figure what's happening if some glitches counters start going
through the roof. The macro supports an optional string argument to
describe the cause of the glitch (e.g. "truncated header"), which is then
reported in the dump.
For now this is conditioned by DEBUG_GLITCHES but if it turns out to be
light enough, maybe we'll keep it enabled full time. In this case it
might have to be moved away from debug dev, or at least documented (or
done as debug counters maybe so that dev can remain undocumented and
updatable within a branch?).
In order to count new event types, we'll need to support empty conditions
so that we don't have to fake if (1) that would pollute the output. This
change checks if #cond is an empty string before concatenating it with
the optional var args, and avoids dumping the colon on the dump if the
whole description is empty.
After master-worker refactoring, master performs re-exec only once up to
receiving "reload" command or USR2 signal. There is no more the second
master's re-exec to free unused memory. Thus, there is no longer need to export
environment variable HAPROXY_LOAD_SUCCESS with worker process load status. This
status can be simply saved in a global variable load_status.
Since 3.0, it is possible to assign a GUID to proxies, listeners and
servers. These objects are stored in a global tree guid_tree.
Proxies and listeners are static. However, servers may be added or
deleted at runtime, which imply that guid_tree must be protected. Fix
this by declaring a read-write lock to protect tree access.
For now, only guid_insert() and guid_remove() are protected using a
write lock. Outside of these, GUID tree is not accessed at runtime. If
server CLI commands are extended to support GUID as server identifier,
lookup operation should be extended with a read lock protection.
Note that during stat-file preloading, GUID tree is accessed for lookup.
However, as it is performed on startup which is single threaded, there
is no need for lock here. A BUG_ON() has been added to ensure this
precondition remains true.
This bug could caused a segfault when using dynamic servers with GUID.
However, it was never reproduced for now.
This must be backported up to 3.0. To avoid a conflict issue, the
previous cleanup patch can be merged before it.
event_hdl_sub_list_empty() may be used to know if the subscription list
passed as argument is empty or not (ie: if there currently are any
subcribers or not). It can be useful to know if the subscription is empty
is order to avoid unecessary preparation work and skip event publishing to
save CPU time if we already know that no one is interested in tracking the
changes for a given subscription list.
In order to help users detect when threads are behaving abnormally, let's
try to emit a warning when one is no longer making any progress. This will
allow to catch faulty situations more accurately, instead of occasionally
triggering just after the long task. It will also let users know that there
is something wrong with their configuration, and inspect the call trace to
figure whether they're using excessively long rules or Lua for example (the
usual warnings about lua-load vs lua-load-per-thread are still reported).
The warning will only be emitted for threads not yet marked as stuck so
as not to interfere with panic dumps and avoid sending a warning just
before a panic. A tainted flag is set when this happens however (0x2000).
There's currently no way to just emit a warning informing that a thread
is stuck without crashing. This is a problem because sometimes users
would benefit from this info to clean up their configuration (e.g. abuse
of map_regm, lua-load etc).
This commit adds a new function ha_stuck_warning() that will emit a
warning indicating that the designated thread has been stuck for XX
milliseconds, with a number of streams blocked, and will make that
thread dump its own state. The warning will then be sent to stderr,
along with some reminders about the impacts of such situations to
encourage users to fix their configuration.
In order not to disrupt operations, a local 4kB buffer is allocated
in the stack. This should be quite sufficient.
For now the function is not used.
The comment asks to update the "metrics_info" array, which does not
exist, instead it's called stat_cols_info[] and is in stats.c. Let's
mention all that to save time searching for the needed info.
While no version seems to have ever known that "metrics_info", it's not
needed to backport this as it's only a comment.
A ClientHello may be splitted accross several different CRYPTO frames,
then mixed in a single QUIC packet. This is used notably by clients such
as chrome to render the first Initial packet opaque to middleboxes.
Each packet frame is handled sequentially. Out-of-order CRYPTO frames
are buffered in a ncbuf, until gaps are filled and data is transferred
to the SSL stack. If CRYPTO frames are heavily splitted with small
fragments, buffering may fail as ncbuf does not support small gaps. This
causes the whole packet to be rejected and unacknowledged. It could be
solved if the client reemits its ClientHello after remixing its CRYPTO
frames.
This patch is written to improve CRYPTO frame parsing. Each CRYPTO
frames which cannot be buffered due to ncbuf limitation are now stored
in a temporary list. Packet parsing is completed until all frames have
been handled. If temporary list is not empty, reparsing is done on the
stored frames. With the newly buffered CRYPTO frames, ncbuf insert
operation may this time succeeds if the frame now covers a whole gap.
Reparsing will loop until either no progress can be made or it has been
done at least 3 times, to prevent CPU utilization.
This patch should fix github issue #2776.
This should be backported up to 2.6, after a period of observation. Note
that it relies on the following refactor patches :
MINOR: quic: extend return value of CRYPTO parsing
MINOR: quic: use dynamically allocated frame on parsing
MINOR: quic: simplify qc_parse_pkt_frms() return path
qc_handle_crypto_frm() is the function used to handled a newly received
CRYPTO frame. Change its API to use a newly dedicated return type. This
allows to report if the frame was properly handled, ignored if already
parsed previously or rejected after a fatal error.
This commit does not have any functional changes. However, it allows to
simplify qc_handle_crypto_frm() API by removing <fast_retrans> as output
parameter. Also, this patch will be necessary to support multiple
iteration of packet parsing for CRYPTO frames.