As mentioned in b40d804 ("MINOR: sink: add some comments about sft->appctx
usage in applet handlers"), there are few places in the code where it
looks like we assumed that the applet callbacks such as
sink_forward_session_init() or sink_forward_io_handler() could be
executing an appctx whose sft is detached from the appctx
(appctx != sft->appctx).
In practise this should not be happening since an appctx sticks to the
same thread its entire lifetime, and the only times sft->appctx is
effectively assigned is during the session/appctx creation (in
process_sink_forward()) or release.
Thus if sft->appctx wouldn't point to the appctx that the sft was bound
to after appctx creation, it would probably indicate a bug rather than
an expected condition. To further emphasize that and prevent the
confusion, and since 3.1-dev4 was released, let's remove such checks and
instead add a BUG_ON to ensure this never happens.
In _sink_forward_io_handler(), the "hard_close" label was removed since
there are no more uses for it (no hard errors may be caught from the
function for now)
With AWS-LC, the aead part is covered by the EVP_AEAD API which
provides the correct EVP_aead_chacha20_poly1305(), however for header
protection it does not provides an EVP_CIPHER for chacha20.
This patch implements exceptions in the header protection code and use
EVP_CIPHER_CHACHA20 and EVP_CIPHER_CTX_CHACHA20 placeholders so we can
use the CRYPTO_chacha_20() primitive manually instead of the EVP_CIPHER
API.
This requires to check if we are using EVP_CIPHER_CTX_CHACHA20 when
doing EVP_CIPHER_CTX_free().
In order to prepare the code for using Chacha20 with the EVP_AEAD API,
both quic_tls_hp_decrypt() and quic_tls_hp_encrypt() need an extra key
argument.
Indeed Chacha20 does not exists as an EVP_CIPHER in AWS-LC, so the key
won't be embedded into the EVP_CIPHER_CTX, so we need an extra parameter
to use it.
Some of the crypto functions used for headers protection in QUIC are
named with an "aes" name even thought they are not used for AES
encryption only.
This patch renames these "aes" to "hp" so it is clearer.
The QUIC crypto is using the EVP_CIPHER API in order to achieve
authenticated encryption, this was the API which was used with OpenSSL.
With libraries that inspires from BoringSSL (libreSSL and AWS-LC), the
AEAD algorithms are implemented using the EVP_AEAD API.
This patch converts the call to the EVP_CIPHER API when called in the
contex of AEAD cryptography for QUIC.
The patch defines some QUIC_AEAD macros that can be either EVP_CIPHER or
EVP_AEAD depending on the library.
This was mainly done for AWS-LC but this could be useful for other
libraries. This should finally allow to use CHACHA20_POLY1305 with
AWS-LC.
This patch allows to use the following ciphers with the EVP_AEAD API:
- TLS1_3_CK_AES_128_GCM_SHA256
- TLS1_3_CK_AES_256_GCM_SHA384
AWS-LC does not implement TLS1_3_CK_AES_128_CCM_SHA256 and
TLS1_3_CK_CHACHA20_POLY1305_SHA256 requires some hack for headers
protection which will come in another patch.
K cubic variable is stored in ms. But it was a formula with the second as unit
for the window difference parameter which was used to compute K without
considering the loss of information. Then the result was converted in ms (K *= 1000).
This leaded to a lack of precision and multiples of 1000 as values.
To fix this, use the same formula but with the window difference in ms as parameter
passed to the cubic function and remove the conversion.
Must be backported as far as 2.6.
In c454296f0 ("OPTIM: sink: balance applets accross threads"), we already
made sure to balance applets accross threads by picking a random thread
to spawn the new applet.
Also, thanks to the previous commit, we also have the ability to destroy
the applet when a certain amount of messages were processed to help
distribute the load during runtime.
Let's improve that by trying up to 3 different threads in the hope to pick
a non-overloaded one in the best scenario, and the least over loaded one
in the worst case. This should help to better distribute the load over
multiple threads when high loads are expected.
Logic was greatly inspired from thread migration logic used by server
health checks, but it was simpliflied for sink's use case.
Thanks to the previous commit, it is now possible to know how many events
were processed for a given sft/server sink pair. As mentioned in commit
c454296 ("OPTIM: sink: balance applets accross threads"), let's provide
the ability to restart a server connection when a certain amount of events
were processed to help better balance the load over multiple threads.
For this, we make use the of "max-reuse" server keyword which was only
relevant under "http" context so far. Under sink context, "max-reuse"
corresponds to the number of times the tcp connection can be reused
for sending messages, which in fact means that "max-reuse + 1" is the
number of events (ie: messages) that are allowed to be sent using the
same tcp server connection: when this threshold is met, the connection
will be destroyed and a new one will be created on a random thread.
The value is not strict: it is the minimum value above which the
connection may be destroyed since the value is checked after
ring_dispatch_messages() which may process multiple messages at once.
By default, no limit is enforced (the connection will be reused for as
long as it is available).
The documentation was updated accordingly.
Add a new struct member to sft structure named e_processed in order to
track the total number of events processed by sft applets.
sink_forward_oc_io_handler() and sink_forward_io_handler() now make use
of ring_dispatch_messages() optional value added in the previous commit
in order to increase the number of processed events.
ring_dispatch_messages() now takes an optional argument <processed> which
must point to a size_t counter when provided.
When provided, the value is updated to the number of messages processed
by the function.
Given that sink applets are responsible for conveying messages from the
ring to the tcp server endpoint, there are no protocol timeout or errors
expected there, it is an unidirectional flow of data over TCP.
As such, NOLINGER flag which was inherited from peers applet, see
dbd026792 ("BUG/MEDIUM: peers: set NOLINGER on the outgoing stream
interface") is not desirable under sink context:
The reason why we have the NOLINGER flag set is to ensure the connection
is closed right away and avoid 60s TIME_WAIT delay on closed sockets.
The downside is that messages sent right before closing the socket are
not guaranteed to make it to the server because closing with NOLINGER
flag set will result in RST packet being emitted right away, which could
prevent in-flight messages from being properly delivered.
Unlike peers applets, the only cases were sink applets are expected to
close the connection are upon unexpected error or upon stopping, which are
relatively rare events. Thanks to previous commit, ERROR flag is already
set in case of error, so the use of NOLINGER is not mandatory for the
RST to be sent. Now for the stopping case, it only happens once in the
process lifetime so it's acceptable to close the socket using EOS+EOI
flags without the NOLINGER option set.
So in our case, it is preferable to ensure messages get properly delivered
knowning that closed sockets should be piling up in TIME_WAIT, this means
removing the NOLINGER flag on the outgoing stream interface for sink
applets. It is a prerequisite for upcoming patches in order to cleanly
shut the applet during runtime without risking to send the RST packet
before all pending messages were sent to the endpoint.
Aborting the socket on soft-stop is not the same as aborting it due to
unexpected error. As such, let's leverage the granularity offered by
sedesc flags to better reflect the situation: abort during soft-stop is
handled as a soft close thanks to EOI+EOS flags, while abort due to
unexpected error is handled as hard error thanks to ERROR+EOS flags.
Thanks to this change, hard error will always emit RST packet even if
the NOLINGER option wasn't set on the socket.
There seem to be an ambiguity in the code where sft->appctx would differ
from the appctx that was assigned to it upon appctx creation.
In practise, it doesn't seem this could be happening. Adding a few notes
to come back to this later and try to see if we can remove this
ambiguity.
Now that sink_forward_oc_io_handler() and sink_forward_io_handler() were
unified again thanks to the previous commit, let's take a chance to merge
code that is common to both functions in order to ease code maintenance.
Let's add _sink_forward_io_handler() internal function which takes the
applet and a message handler as argument: sink_forward_io_handler() and
sink_forward_oc_io_handler() leverage this internal function by passing
the correct message handler for the desired format.
Re-apply dcd917d972 ("MINOR: applet: Remove uselelss test on SE_FL_SHR/SHW
flags") for sink_forward_oc_io_handler() function as it was probably
overlooked given that sink_forward_oc_io_handler() and
sink_forward_io_handler() follow the same logic.
In a739dc2 ("MEDIUM: sink: Use the sedesc to report and detect end of
processing"), we added a drain after close in sink_forward_oc_io_handler()
by the use of "goto out".
However, since we perform a close, there is no reason to drain data from
the socket. Moreover, before the patch there was no drain and nothing
mentioned the fact that that the drain was added on purpose. Lastly,
sink_forward_io_handler() and sink_forward_oc_io_handler() functions are
strictly identical when in comes to processing logic, and the drain was
only added in sink_forward_oc_io_handler() and not in
sink_forward_io_handler().
As such, it's pretty safe to assume that the drain is not needed here
and was added as accident. So in this patch we remove it in an attempt
to unify sink_forward_io_handler() and sink_forward_oc_io_handler()
functions like it was already the case before.
Since 09d69eacf8 ("MEDIUM: sink: start applets asynchronously") the applet
is no longer initialized under the sft lock while it was the case before.
At first it doesn't seem to be an issue, but if we look closer at
sink_forward_session_init(), we can see that sft->appctx is assigned
while it can be accessed at the same time from sink_init_forward().
Let's restore the old guarantees by performing the .init under the sft
lock.
No backport needed unless 09d69eacf8 is.
To be able to retrieve accurrate errors when a SPOP health-check is
performed, a customized tcp-check is used. Indeed, it is not possible to
rely on the SPOP multiplexer for now because the check is performed at the
mux connection layer and the error, if any, cannot be retrieved by the
health-check. A L4 success or error is reported.
To fix this issue and restore the previous behavior, a customized tcp-check
is created. The connection is forced to use the PT multiplexer. An hardcoded
message is sent and a customer handler is used to decode the SPOA response.
This way, it is possible to parse the response and return an accurrate
status code.
spoe_check_vsn() function can now be used to check if a version, converted
to an integer, via spoe_str_to_vsn() for instance, is supported. To do so,
the list of all supported version is now exported.
Add two initcall callback with BUG_ON_HOT() to newro and cubic modules to
ensure there is no buffer overflow when accessing the private data of
these congestion control algorithm state structures. This is to ensure
that further modifications about these data structures will not
lead to surprises. At this time there is no possible buffer overflow.
This bug arrived with this commit:
b068e758f MINOR: quic: simplify rescheduling for handshake
This commit introduced a bad side effect. Haproxy always replied by an ACK-only
datagram when it received the first client Initial packet. Then it handled
the CRYPTO data insided. And finally, it sent its own CRYPTO data. This broke
the packet coalescing rule whose aim is to optimally build and send as more
as QUIC packets by datagram.
To fix this, simply partially reverts this commit, to make the low level I/O
task return again if some CRYPTO were received. This will delay the
acknowledgement which will be sent with the CRYPTO data from the same
datagram again.
Must be backported to 3.0.
When a SPOE applet is created to send a message to an agent, the parent of
the associated stream is set to the one filtered. And the relationship
between the streams is removed when the applet is released or when the
processing on main stream is finished.
In the mean time, it is possible to get variables of the parent stream from
the SPOE one. It is not a huge change but this will be amazingly useful. For
instance, it is now possible to be sticky on a server using a critera of the
main streem. Here is an example using the client source address:
listen http
bind *:80
tcp-request content set-var(txn.client_src) src
filter spoe engine {SPOE-NAME} config /{SPOE-CONFIG}
http-request send-spoe-group {SPOE-NAME} {SPOE-MSG}
server www 127.0.0.1:8000
backend spoe-backend
mode spop
timeout server 10s
stick-table type ip size 200k expire 30m
stick on var(ptxn.client_src)
server srv1 ...
server srv2 ...
server srv3 ...
server srv4 ...
Of course, the feature is not limited to stick-tables. Everywhere variables
are used, it is now possible to get the value set on the parent stream from
the SPOE stream.
It is now possible to retrieved the value of a variable using the parent
stream or the parent session instead of the current one. It remains
forbidden to set or unset this value. The sample fetch used to store the
result is a local copy. So it may be safely altered by a converter without
changing the value of the original variable.
Note that for now, the parent of a stream is never set. So this part is not
really used. This will change with the SPOE.
Now a variable description is retrieved when a variable is parsed, we can
use it to get the variable value. It is mandatory to be able to know the
parent stream, if any, must be used, instead of the current one.
Add session/stream scopes related to the parent. To do so, "psess", "ptxn",
"preq" or "pres" must be used instead of tranditionnal scopes (without the
first "p"). the "proc" scope is not concerned by this change because it is
not linked to a stream. When such scopes are used, a specific flags is added
on the variable description during the variable parsing.
For now, theses scopes are parsed and the variable description is updated
accordingly. But at the end, any operation on the variable value fails.
Now a variable description is retrieved when a variable is parsed, we can
use it to set or unset the variable value. It is mandatory to be able to
know the parent stream, if any, must be used, instead of the current one.
A variable description is now used to parse a variable and extract its name
and its scope. It is mandatory to be able to add some flags on the variable
when it is evaluated (set or get). Among other things, this will be used to
know the parent stream, if any, must be used, instead of the current one.
A pointer to a parent stream was added in the stream structure. For now,
this pointer is never set, but the idea is to have an access to a stream
environment from another one from the moment there is a parent/child
relationship betwee these streams.
Concretely, for now, there is nothing to formalize this relationship.
The global request counter is used to set the stream id (s->uniq_id). It is
incremented at different places. And it must be atomically incremented
because it is a global value. However, in the analyer dealing with CLI
command response, this was not the case. It is now fixed.
This patch must be backported to all stable versions.
When a fallback IP address is provided in the list of methods to use to
resolve the server address, a warning is emitted if previous methods
failed. The aim is to inform this address will be used for the
server. However, it is valid use-case. It is the expected behavior. There is
no reason to emit a warning. Having a message during HAProxy startup to
inform the fallback IP address will be used is probably a good idea. But it
should be a notice not a warning. Otherwise, checking the configuration
validity will always failed, just like starting HAProxy in zero-warning
mode while the option was set on purpose.
This patch should fix the issue #2627. It must be backported to all stable
versions.
Since 2.5, an array of GPC is provided to replace legacy gpc0/gpc1.
src_inc_gpc is a sample fetch which is used to increment counters in
this array.
A crash occurs if src_inc_gpc is used without any previous track-sc
rule. This is caused by an error in smp_fetch_sc_inc_gpc(). When
temporary stick counter is created via smp_create_src_stkctr(), table
pointer arg value used is not correct : it points to the counter ID
instead of the table argument. To fix this, use the proper sample fetch
second arg.
This can be reproduced with the following config :
acl mark src_inc_gpc(0,<table>) -m bool
tcp-request connection accept if mark
This should be backported up to 2.6.
This commit continues to clean up cfg_parse_global() and to prepare the
refactoring of master-worker mode. Master, after forking a worker, enters in
its wait polling loop to catch signals and to provide master CLI. So, some
poller types could be disabled for master process it as well.
This commit cleans up cfg_parse_global() and prepares the config parser for
master-worker mode refactoring, where daemon and master-worker fork() calls
will happen very early in init().
So, the config in such case should be read twice:
- at first: only some keywords in the global section for the mode discovery
and everything, which is related to master process by opportunity;
- at second: except the master process, all other keywords would be parsed;
Fix build warning on NetBSD by reapplying f278eec37a ("BUILD: tree-wide:
cast arguments to tolower/toupper to unsigned char").
This should fix issue #2551.
Let's check the second time a global counter of "ha_warning" messages, if
zero-warning is set. And let's do this just before forking. At this moment we
are sure, that we've already done all init operations, where we could emit
"ha_warning", and we still have stderr fd opened.
Even with the second check, we could lost some late and rare warnings
about failing to drop supplementary groups and about re-enabling core dumps.
Notes about this are added into 'zero-warning' keyword description.
Since data->chain is now completed when loading the files, we don't need
to use ssl_get0_issuer_chain() anywhere else in the code.
data->chain will always be completed once the files are loaded, but we
can't know from show_cert_detail() from what chain file it was completed.
That's why the extra_chain pointer was added to dump the chain file.
This fixes OCSP, when issuer chain is in a separate PEM file. This is a
case of issuers-chain-path keyword, which points to folder that contains only
PEM with RootCA and IntermediateCA.
Before this patch, the chain from 'issuers-chain-path' was applied
directly to the SSL_CTX without being applied to the data->chain
structure. This would work for SSL traffic, but every tests done with
data->chain would fail, OCSP included, because the chain would be NULL.
This patch moves the loading of the chain from
ssl_sock_load_cert_chain(), which is the function that applies the chain
to the SSL_CTX, to ssl_sock_load_pem_into_ckch() which is the function
that loads the files into the ckch_data structure.
Fixes issue #2635 but it changes thing on the CLI, so that's not
backportable.
Most of the time all sink applets (which are responsible for relaying
messages from the ring to the tcp servers endpoints) would end up being
assigned to the first available thread (tid:0), resulting in excessive CPU
usage on a single thread when multiple sink servers were defined (no
matter if they were defined over multiple "ring" sections) and significant
message load was pushed through them over the ring API.
This patch is similar to 34e4085f ("MEDIUM: peers: Balance applets across
threads") but for sinks. We use a slightly different approach, which is to
elect a random thread instead of picking the one with leasts applets. This
proves to be already sufficient to alleviate the issue.
In the case we want to have a better load distribution we should consider
breaking existing connections to reestablish them on a new thread when we
find out that they start monopolizing a cpu thread (ie: after a certain
amount of messages for instance). Also check tcpchecks migrating model for
inspiration.
This patch depends on the previous one ("MEDIUM: sink: start applets
asynchronously").
Since d9c1d33fa1 ("MEDIUM: applet: Add support for async appctx startup
on a thread subset"), it is now possible to delay appctx's init: for that
it is required that the .init callback is defined on the applet.
When the applet will be processed on the first run, applet API will
automatically finish the applet initialization. Thus we explicitly
call appctx_wakeup() on the applet to schedule it for initial run
instead of calling appctx_init() ourselves.
This is done in prevision of the next patch in order to be able to
schedule the applet on a different thread from the one executing
sink_forward_session_create() function.
Note: 'out_free_appctx' label was removed since it is no longer used.
A risk of truncated packet was addressed in 2.9 by commit 19fb19976f
("BUG/MEDIUM: mux-h2: Only Report H2C error on read error if demux
buffer is empty") by ignoring CO_FL_ERROR after a recv() call as long
as some data remained present in the buffer. However it has a side
effect due to the fact that some frame processors only deal with full
frames, for example, HEADERS. The side effect is that an incomplete
frame will not be processed and will remain in the buffer, preventing
the error from being taken into account, so the I/O handler wakes up
the H2 parser to handle the error, and that one just subscribes for
more data, and this loops forever wasting CPU cycles.
Note that this only happens with errors at the SSL layer exclusively,
otherwise we'd have a read0 pending that would properly be detected:
conn->flags = CO_FL_XPRT_TRACKED | CO_FL_ERROR | CO_FL_XPRT_READY | CO_FL_CTRL_READY
conn->err_code = CO_ERR_SSL_FATAL
h2c->flags = H2_CF_ERR_PENDING | H2_CF_WINDOW_OPENED | H2_CF_MBUF_HAS_DATA | H2_CF_DEM_IN_PROGRESS | H2_CF_DEM_SHORT_READ
The condition to report the error in h2_recv() needs to be refined, so
that connection errors are taken into account either when the buffer is
empty, or when there's an incomplete frame, since we're certain it will
never be completed. We're certain to enter that function because
H2_CF_DEM_SHORT_READ implies too short a frame, and earlier there's a
protocol check to validate that no frame size is larger than bufsize,
hence a H2_CF_DEM_SHORT_READ implies there's some room left in the
buffer and we're allowed to try to receive.
The condition to reproduce the bug seems super hard to meet but was
observed once by Patrick Hemmer who had the reflex to capture lots of
information that allowed to explain the problem. In order to reproduce
it, the SSL code had to be significantly modified to alter received
contents at very empiric places, but that was sufficient to reproduce
it and confirm that the current patch works as expected.
The bug was tagged MAJOR because when it triggers there's no other
solution to get rid of it but to restart the process. However given how
hard it is to trigger on a lab, it does not seem very likely to occur
in field.
This needs to be backported to 2.9.
We could run under heavy load in containers or on premises and some automatic
tool in parallel could use CLI to check OCSP updates statuses or to upload new
OCSP responses. So, calloc() to store OCSP update callback arguments may fail
and ocsp_tree_lock need to be unlocked, when exiting due to this failure.
This needs to be backported in all stable versions until v2.4.0 included.
It's usefull to keep runtime limits (fd and RAM) in postmortem and show them in
debug_parse_cli_show_dev(). Runtime limits are fed in feed_post_mortem_late(),
as we are sure that at this moment that all configuration was parsed and
all applied limits were alredy adjusted.
This is a preparation patch to extend postmortem in order to store runtime
limits. No need to perform getrlimit() in feed_post_mortem(), as we do this
in the very beginning of main() and we store initial fd limits in global
'rlim_fd_cur_at_boot' and 'rlim_fd_max_at_boot' variables.
It is more handy to use LIM2A in debug_parse_cli_show_dev(), as it allows to
show a custom string ("unlimited"), if a given limit value equals to 0.
normalize_rlim() handler is needed to convert properly RLIM_INFINITY to zero,
with the respect of type sizes, as rlim_t is always 4 bytes on 32bit and
64bit arch.
Let's extend postmortem to keep process runtime capabilities. This information
is gathered in feed_post_mortem_late(), as it is called just before
run_poll_loop() and we are sure at this moment, that all configuration
settings were successfully applied.
Let's extend post_mortem to store runtime process uid and gid.
This information is fed in feed_post_mortem_late(), just before calling
run_poll_loop(). Like this we are sure that all configuration settings were
successfully applied.
Process runtime information could be very useful in post_mortem, but we have to
collect it just before calling run_poll_loop(). Like this we are sure, that
we've successfully applied all configuration parameters and what we've
collected are the latest runtime settings.
The most appropraite place to collect such information is
feed_post_mortem_late(). It's called in each thread, but puts thread info in
the post_mortem only when it's in the last thread context. As it's called
under mutex lock, other threads at this moment have to wait until
feed_post_mortem_late() and another initialization functions from
per_thread_init_list will finish. The number of threads could be large. So, to
avoid spending a lot of time under the lock, let's exit immediately from
feed_post_mortem_late(), if it wasn't called in the last thread.