The fallback macros for when stdio is not there didn't have the "..."
and were causing build issues on platforms with stricter dependencies
between includes.
This patch is a prerequisite for #1626.
Adding PAUSED state to the list of available proxy states.
The flag is set when the proxy is paused at runtime (pause_listener()).
It is cleared when the proxy is resumed (resume_listener()).
It should be backported to 2.6, 2.5 and 2.4
A minor API change was performed in listener(.c/.h) to restore consistency
between stop_listener() and (resume/pause)_listener() functions.
LISTENER_LOCK was never locked prior to calling stop_listener():
lli variable hint is thus not useful anymore.
Added PROXY_LOCK locking in (resume/pause)_listener() functions
with related lpx variable hint (prerequisite for #1626).
It should be backported to 2.6, 2.5 and 2.4
The new function is strm_et_show_flags(). Only the error type is
handled at the moment, as a bit more complex logic is needed to
mix the values and enums present in some fields.
The two new functions are se_show_flags() and sc_show_flags().
Maybe something could be done for SC_ST_* values but as it's a
small enum, a simple switch/case should work fine.
The two new functions are chn_show_analysers() and chn_show_flags().
They work on an existing buffer so one was declared in flags.c for this
purpose. File flags.c does not have to know about channel flags anymore.
Some of our flags have enums inside a mask. The new macro __APPEND_ENUM
is able to deal with that by comparing the flag's value against an exact
one under the mask. One needs to take care of eliminating the zero value
though, otherwise delimiters will not always be properly placed (e.g. if
some flags were dumped before and what remains is exactly zero). The
bits of the mask are cleared only upon exact matches.
The "flags" utility is useful but painful to maintain up to date. This
commit aims at providing a low-maintenance solution to keep flags up to
date, by proposing some macros that build a string from a set of flags
in a way that requires the least possible verbosity.
The idea will be to add an inline function dedicated to this just after
the flags declaration, and enumerate the flags one is interested in, and
that function will fill a string based on them.
Placing this inside the type files allows both haproxy and external tools
like "flags" to use it, but comes with a few constraints. First, the
files will be slightly less readable if these functions are huge, so they
need to stay as compact as possible. Second, the function will need
anprintf() and we don't want to include stdio.h in type files as it
proved to be particularly heavy and to cause definition headaches in
the past.
As such the file here only contains a macro enclosed in #ifdef EOF (that
is defined in stdio), and provides an alternate empty one when no stdio
is defined. This way it's the caller that has to include stdio first or
it won't get anything back, and in practice the locations relying on
this always have it.
The macro has to be used in 3 steps:
- prologue: dumps 0 and exits if the value is zero
- flags: the macro can be recursively called and it will push the
flag from bottom to top so that they appear in the same order as
today without requiring to be declared the other way around
- epilogue: dump remaining flags that were not identified
The macro was arranged so that a single character can be used with no
other argument to declare all flags at once. Example:
#define _(n, ...) __APPEND_FLAG(buf, len, del, flg, n, #n, __VA_ARGS__)
_(0);
_(X_FLAG1, _(X_FLAG2, _(X_FLAG3, _(X_FLAG4))));
_(~0);
#undef _
Existing files will have to be updated to rely on it, and more files
could come soon.
This is the ->finalize application callback which prepares the unidirectional STREAM
frames for h3 settings and wakeup the mux I/O handler to send them. As haproxy is
at the same time always waiting for the client request, this makes haproxy
call sendto() to send only about 20 bytes of stream data. Furthermore in case
of heavy loss, this give less chances to short h3 requests to succeed.
Drawback: as at this time the mux sends its streams by their IDs ascending order
the stream 0 is always embedded before the unidirectional stream 3 for h3 settings.
Nevertheless, as these settings may be lost and received after other h3 request
streams, this is permitted by the RFC.
Perhaps there is a better way to do. This will have to be checked with Amaury.
Must be backported to 2.6.
It is possible to speed up the handshake completion but only one time
by connection as mentionned in RFC 9002 "6.2.3. Speeding up Handshake Completion".
Add a flag to prevent this process to be run several times
(see https://www.rfc-editor.org/rfc/rfc9002#name-speeding-up-handshake-compl).
Must be backported to 2.6.
This removes all the hard-coded 8-bit and 256 entries to use a pair of
macros instead so that we can more easily experiment with larger table
sizes if needed.
There used to be one tid for tasklets and a thread_mask for tasks. Since
2.7, both tasks and tasklets now use a tid (albeit with a very slight
semantic difference for the negative value), to in order to limit code
duplication and to ease debugging it makes sense to move tid into the
common part. One limitation is that it will leave a hole in the structure,
but we now have the wake_date that is always present and can move there as
well to plug the hole.
This results in something overall pretty clean (and cleaner than before),
with the low-level stuff (state,tid,process,context) appearing first, then
the caller stuff (caller,wake_date,calls,debug) next, and finally the
type-specific stuff (rq/wq/expire/nice).
Instead of storing an index that's swapped at every call, let's use the
two pointers as a shifting history. Now we have a permanent "caller"
field that records the last caller, and an optional prev_caller in the
debug section enabled by DEBUG_TASK that keeps a copy of the previous
caller one. This way, not only it's much easier to follow what's
happening during debugging, but it saves 8 bytes in the struct task in
debug mode and still keeps it under 2 cache lines in nominal mode, and
this will finally be usable everywhere and later in profiling.
The caller_idx was also used as a hint that the entry was freed, in order
to detect wakeup-after-free. This was changed by setting caller to -1
instead and preserving its value in caller[1].
Finally, the operations were made atomic. That's not critical but since
it's used for debugging and race conditions represent a significant part
of the issues in multi-threaded mode, it seems wise to at least eliminate
some possible factors of faulty analysis.
appctx_wakeup() relies on task_wakeup(), but since it calls it from a
function, the calling place is always appctx_wakeup() itself, which is
not very useful.
Let's turn it to a macro so that we can log the location of the caller
instead. As an example, the cli_io_handler() which used to be seen as
this:
(gdb) p *appctx->t.debug.caller[0]
$10 = {
func = 0x9ffb78 <__func__.37996> "appctx_wakeup",
file = 0x9b336a "include/haproxy/applet.h",
line = 110,
what = 1 '\001',
arg8 = 0 '\000',
arg32 = 0
}
Now shows the more useful:
(gdb) p *appctx->t.debug.caller[0]
$6 = {
func = 0x9ffe80 <__func__.38641> "sc_app_chk_snd_applet",
file = 0xa00320 "src/stconn.c",
line = 996,
what = 6 '\006',
arg8 = 0 '\000',
arg32 = 0
}
This reduces the task struct by 8 bytes, reduces the code size a little
bit by simplifying the calling convention (one argument dropped), and
as a bonus provides the function name in the caller.
The memstats code currently defines its own file/function/line number,
type and extra pointer. We don't need to keep them separate and we can
easily replace them all with just a struct ha_caller. Note that the
extra pointer could be converted to a pool ID stored into arg8 or
arg32 and be dropped as well, but this would first require to define
IDs for pools (which we currently do not have).
The purpose of this structure is to assemble all constant parts of a
generic calling point for a specific event. These ones are created by
the compiler as a static const element outside of the code path, so
they cost nothing in terms of CPU, and a pointer to that descriptor
can be passed to the place that needs it. This is very similar to what
is being done for the mem_stat stuff.
This will be useful to simplify and improve DEBUG_TASK.
There are a few places where it's convenient to hash a pointer to compute
a statistics bucket. Here we're basically reusing the hash that was used
by memory profiling with a minor update that the multiplier was corrected
to be prime and stand by its promise to have equal numbers of 1 and 0,
and that 32-bit platforms won't lose range anymore.
A two-pointer variant was also added.
It was a mistake to put these two fields in the struct task. This
was added in 1.9 via commit 9efd7456e ("MEDIUM: tasks: collect per-task
CPU time and latency"). These fields are used solely by streams in
order to report the measurements via the lat_ns* and cpu_ns* sample
fetch functions when task profiling is enabled. For the rest of the
tasks, this is pure CPU waste when profiling is enabled, and memory
waste 100% of the time, as the point where these latencies and usages
are measured is in the profiling array.
Let's move the fields to the stream instead, and have process_stream()
retrieve the relevant info from the thread's context.
The struct task is now back to 120 bytes, i.e. almost two cache lines,
with 32 bit still available.
The profile entry that corresponds to the current task/tasklet being
profiled is now stored into the thread's context. This will allow it
to be accessed from the tasks themselves. This is needed for an upcoming
fix.
When task profiling is enabled, the scheduler can measure and report
the cumulated time spent in each task and their respective latencies. But
this was wrong for tasks with few wakeups as well as for self-waking ones,
because the call date needed to measure how long it takes to process the
task is retrieved in the task itself (->wake_date was turned to the call
date), and we could face two conditions:
- a new wakeup while the task is executing would reset the ->wake_date
field before returning and make abnormally low values being reported;
that was likely the case for taskrun_applet for self-waking applets;
- when the task dies, NULL is returned and the call date couldn't be
retrieved, so that CPU time was not being accounted for. This was
particularly visible with process_stream() which is usually called
only twice per request, and whose time was systematically halved.
The cleanest solution here is to keep in mind that the scheduler already
uses quite a bit of local context in th_ctx, and place the intermediary
values there so that they cannot vanish. The wake_date has to be reset
immediately once read, and only its copy is used along the function. Note
that this must be done both for tasks and tasklet, and that until recently
tasklets were also able to report wrong values due to their sole dependency
on TH_FL_TASK_PROFILING between tests.
One nice benefit for future improvements is that such information will now
be available from the task without having to be stored into the task itself
anymore.
Since the tasklet part was computed on wrapping 32-bit arithmetics and
the task one was on 64-bit, the values were now consistently moved to
32-bit as it's already largely sufficient (4s spent in a task is more
than twice what the watchdog would tolerate). Some further cleanups might
be necessary, but the patch aimed at staying minimal.
Task profiling output after 1 million HTTP request previously looked like
this:
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
h1_io_cb 2012338 4.850s 2.410us 12.91s 6.417us
process_stream 2000136 9.594s 4.796us 34.26s 17.13us
sc_conn_io_cb 2000135 1.973s 986.0ns 30.24s 15.12us
h1_timeout_task 137 - - 2.649ms 19.34us
accept_queue_process 49 152.3us 3.107us 321.7yr 6.564yr
main+0x146430 7 5.250us 750.0ns 25.92us 3.702us
srv_cleanup_idle_conns 1 559.0ns 559.0ns 918.0ns 918.0ns
task_run_applet 1 - - 2.162us 2.162us
Now it looks like this:
Tasks activity:
function calls cpu_tot cpu_avg lat_tot lat_avg
h1_io_cb 2014194 4.794s 2.380us 13.75s 6.826us
process_stream 2000151 20.01s 10.00us 36.04s 18.02us
sc_conn_io_cb 2000148 2.167s 1.083us 32.27s 16.13us
h1_timeout_task 198 54.24us 273.0ns 3.487ms 17.61us
accept_queue_process 52 158.3us 3.044us 409.9us 7.882us
main+0x1466e0 18 16.77us 931.0ns 63.98us 3.554us
srv_cleanup_toremove_conns 8 282.1us 35.26us 546.8us 68.35us
srv_cleanup_idle_conns 3 149.2us 49.73us 8.131us 2.710us
task_run_applet 3 268.1us 89.38us 11.61us 3.871us
Note the two-fold difference on process_stream().
This feature is essentially used for debugging so it has extremely limited
impact. However it's used quite a bit more in bug reports and it would be
desirable that at least 2.6 gets this fix backported. It depends on at least
these two previous patches which will then also have to be backported:
MINOR: task: permanently enable latency measurement on tasklets
CLEANUP: task: rename ->call_date to ->wake_date
This field is misnamed because its real and important content is the
date the task was woken up, not the date it was called. It temporarily
holds the call date during execution but this remains confusing. In
fact before the latency measurements were possible it was indeed a call
date. Thus is will now be called wake_date.
This change is necessary because a subsequent fix will require the
introduction of the real call date in the thread ctx.
When tasklet latency measurement was enabled in 2.4 with commit b2285de04
("MINOR: tasks: also compute the tasklet latency when DEBUG_TASK is set"),
the feature was conditionned on DEBUG_TASK because the field would add 8
bytes to the struct tasklet.
This approach was not a very good idea because the struct ends on an int
anyway thus it does finish with a 32-bit hole regardless of the presence
of this field. What is true however is that adding it turned a 64-byte
struct to 72-byte when caller debugging is enabled.
This patch revisits this with a minor change. Now only the lowest 32
bits of the call date are stored, so they always fit in the remaining
hole, and this allows to remove the dependency on DEBUG_TASK. With
debugging off, we're now seeing a 48-byte struct, and with debugging
on it's exactly 64 bytes, thus still exactly one cache line. 32 bits
allow a latency of 4 seconds on a tasklet, which already indicates a
completely dead process, so there's no point storing the upper bits at
all. And even in the event it would happen once in a while, the lost
upper bits do not really add any value to the debug reports. Also, now
one tasklet wakeup every 4 billion will not be sampled due to the test
on the value itself. Similarly we just don't care, it's statistics and
the measurements are not 9-digit accurate anyway.
There's a subtle (harmless) bug in task_instant_wakeup(). As it uses
some tasklet code instead of some task code, the debug part also acts
on the tasklet equivalent, and the call_date is only set when DEBUG_TASK
is set instead of inconditionally like with tasks. As such, without this
debugging macro, call dates are not updated for tasks woken this way.
There isn't any impact yet because this function was introduced in 2.6 to
solve certain classes of issues and is not used yet, and in the worst case
it would only affect the reported latency time.
This may be backported to 2.6 in case a future fix would depend on it but
currently will not fix existing code.
The tasklet's call date was not reset, so if profiling was enabled while
some tasklets were in the run queue, their initial random value could be
used to preload a bogus initial latency value into the task profiling bin.
Let's just zero the initial value.
This should be backported to 2.4 as it was brought with initial commit
b2285de04 ("MINOR: tasks: also compute the tasklet latency when DEBUG_TASK
is set"). The impact is very low though.
To work, quic_pin_cid_to_tid() must set cid[0] to a value with <target_id>
as <global.nbthread> modulo. For each integer n, (n - (n % m)) + d has always
d as modulo m (with d < m).
So, this statement seemed correct:
cid[0] = cid[0] - (cid[0] % global.nbthread) + target_tid;
except when n wraps or when another modulo is applied to the addition result.
Here, for 8bit modulo arithmetic, if m does not divides 256, this cannot
works for values which wraps when we increment them by d.
For instance n=255 m=3 and d=1 the formula result is 0 (should be d).
To fix this, we first limit c[0] to 255 - <target_id> to prevent c[0] from wrapping.
Thank you to @esb for having reported this issue in GH #1855.
Must be backported to 2.6
LibreSSL does not implement EVP_chacha20_poly1305() with EVP_CIPHER but
uses the EVP_AEAD API instead:
https://man.openbsd.org/EVP_AEAD_CTX_init
This patch disables this cipher for libreSSL for now.
When building HAProxy with USE_QUIC and libressl 3.6.0, the
ssl_sock_switchtx_cbk symbol is not found because libressl does not
implement the client_hello_cb.
A ssl_sock_switchtx_cbk version for the servername callback is available
but wasn't exported correctly.
This helper will be called for muxes that provide it and will be used
to let the mux provide extra information about the stream attached to
a stream descriptor. A line prefix is passed in argument so that the
mux is free to break long lines without breaking indent. No prefix
means no line breaks should be produced (e.g. for short dumps).
Some recent traces started to show confusing stream pointers ending with
0xe. The reason was that the stream's obj_type was almost unused in the
past and was stuffed in a hole in the structure. But now it's present in
all "show sess all" outputs and having to mentally match this value against
another one that's 0x17e lower is painful. The solution here is to move the
obj_type at the top, like in almost every other structure, but without
breaking the efficient layout.
This patch moves a few fields around and manages to both plug some holes
(16 bytes saved, 976 to 960) and avoid channels needlessly crossing cache
boundaries (res was spread over 3 lines vs 2 now).
Nothing else was changed. It would be desirable to backport this to 2.6
since it's where dumps are currently being processed the most.
As outlined in commit f7ebe584d7 ("BUILD: debug: Add braces to if
statement calling only CHECK_IF()"), the BUG_ON() family of macros
is incorrectly defined to be empty when debugging is disabled, and
that can lead to trouble. Make sure they always fall back to the
usual "do { } while (0)". This may be backported to 2.6 if needed,
though no such issue was met there to date.
This bug arrived with this commit:
"MINOR: quic: Add reusable cipher contexts for header protection"
haproxy could crash because of missing cipher contexts initializations for
the header protection and draft-v2 Initial secrets. This was due to the fact
that these initialization both for RX and TX secrets were done outside of
qc_new_isecs(). The role of this function is definitively to initialize these
cipher contexts in addition to the derived secrets. Indeed this function is called
by qc_new_conn() which initializes the connection but also by qc_conn_finalize()
which also calls qc_new_isecs() in case of a different QUIC version was negotiated
by the peers from the one used by the client for its first Initial packet.
This was reported by "v2" QUIC interop test with at least picoquic as client.
Must be backported to 2.6.
Shut the connect() warning of resolvers_finalize_config() when the
configuration was not emitted manually.
This shuts the warning for the "default" resolvers which is created
automatically for the httpclient.
Must be backported in 2.6.
It is only a real problem for agent-checks when there is no agent string to
send. The condition to disable TCP_QUICKACK was only based on the action
type following the connect one. But it is not always accurate. indeed, for
agent-checks, there is always a SEND action. But if there is no "agent-send"
string defined, nothing is sent. In this case, this adds 200ms of latency
with no reason.
To fix the bug, a flag is now used on the CONNECT action to instruct there
are data that should be sent after the connect. For health-checks, this flag
is set if the action following the connect is a SEND action. For
agent-checks, it is set if an "agent-send" string is defined.
This patch should fix the issue #1836. It must be backported as far as 2.2.
Replace ->rx.pqpkts quic_enc_level struct member MT_LIST by an LIST.
Same thing for ->list quic_rx_packet struct member MT_LIST.
Update the code consequently. This was a reminisence of the multithreading
support (several threads by connection).
Must be backported to 2.6
Some clients send CONNECTION_CLOSE frame without acknowledging the STREAM
data haproxy has sent. In this case, when closing the connection if
there were remaining data in QUIC stream buffers, they were not released.
Add a <closing> boolean option to qc_stream_desc_free() to force the
stream buffer memory releasing upon closing connection.
Thank you to Tristan for having reported such a memory leak issue in GH #1801.
Must be backported to 2.6.
In ticket #1805 an user is impacted by the limitation of size of the CLI
buffer when updating a ca-file.
This patch allows a user to append new certificates to a ca-file instead
of trying to put them all with "set ssl ca-file"
The implementation use a new function ssl_store_dup_cafile_entry() which
duplicates a cafile_entry and its X509_STORE.
ssl_store_load_ca_from_buf() was modified to take an apped parameter so
we could share the function for "set" and "add".
In order to be able to append new CA in a cafile_entry,
ssl_store_load_ca_from_buf() was reworked and a "append" parameter was
added.
The function is able to keep the previous X509_STORE which was already
present in the cafile_entry.
Implement quic_tls_rx_hp_ctx_init() and quic_tls_tx_hp_ctx_init() to initiliaze
such header protection cipher contexts for each RX and TX parts and for each
packet number spaces, only one time by connection.
Make qc_new_isecs() call these two functions to initialize the cipher contexts
of the Initial secrets. Same thing for ha_quic_set_encryption_secrets() to
initialize the cipher contexts of the subsequent derived secrets (ORTT, 1RTT,
Handshake).
Modify qc_do_rm_hp() and quic_apply_header_protection() to reuse these
cipher contexts.
Note that there is no need to modify the key update for the header protection.
The header protection secrets are never updated.