There's no point taking the variables locks for sess/txn/req/res
contexts since these ones always run inside the same thread anyway.
This patch conditions the lock on the variable's scope to avoid
flushing cache lines when not needed.
This showed an improvement of ~5% on a 16-thread machine with 12
variables.
The global table of known variables names can only grow and was designed
for static names that are registered at boot. Nowadays it's possible to
set dynamic variable names from Lua or from the CLI, which causes a real
problem that was partially addressed in 2.2 with commit 4e172c93f
("MEDIUM: lua: Add `ifexist` parameter to `set_var`"). Please see github
issue #624 for more context.
This patch simplifies all this by removing the need for a central
registry of known names, and storing 64-bit hashes instead. This is
highly sufficient given the low number of variables in each context.
The hash is calculated using XXH64() which is bijective over the 64-bit
space thus is guaranteed collision-free for 1..8 chars. Above that the
risk remains around 1/2^64 per extra 8 chars so in practice this is
highly sufficient for our usage. A random seed is used at boot to seed
the hash so that it's not attackable from Lua for example.
There's one particular nit though. The "ifexist" hack mentioned above
is now limited to variables of scope "proc" only, and will only match
variables that were already created or declared, but will now verify
the scope as well. This may affect some bogus Lua scripts and SPOE
agents which used to accidentally work because a similarly named
variable used to exist in a different scope. These ones may need to be
fixed to comply with the doc.
Now we can sum up the situation as this one:
- ephemeral variables (scopes sess, txn, req, res) will always be
usable, regardless of any prior declaration. This effectively
addresses the most problematic change from the commit above that
in order to work well could have required some script auditing ;
- process-wide variables (scope proc) that are mentioned in the
configuration, referenced in a "register-var-names" SPOE directive,
or created via "set-var" in the global section or the CLI, are
permanent and will always accept to be set, with or without the
"ifexist" restriction (SPOE uses this internally as well).
- process-wide variables (scope proc) that are only created via a
set-var() tcp/http action, via Lua's set_var() calls, or via an
SPOE with the "force-set-var" directive), will not be permanent
but will always accept to be replaced once they are created, even
if "ifexist" is present
- process-wide variables (scope proc) that do not exist will only
support being created via the set-var() tcp/http action, Lua's
set_var() calls without "ifexist", or an SPOE declared with
"force-set-var".
This means that non-proc variables do not care about "ifexist" nor
prior declaration, and that using "ifexist" should most often be
reliable in Lua and that SPOE should most often work without any
prior declaration. It may be doable to turn "ifexist" to 1 by default
in Lua to further ease the transition. Note: regtests were adjusted.
Cc: Tim Düsterhus <tim@bastelstu.be>
We certainly do not want that a permanent variable (one that is listed
in the configuration) be erased by accident by an "unset-var" action.
Let's make sure these ones are only reset to an empty sample, like at
the moment of their initial registration. One trick is that the same
function is used to purge the memory at the end and to delete, so we
need to add an extra "force" argument to make the choice.
In order to continue to honor the ifexist Lua option and prevent rogue
SPOA agents from creating too many variables, we'll need to keep the
ability to mark certain proc.* variables as permanent when they're
known from the config file.
Let's add a flag there for this. It's added to the variable when the
variable is created with this flag set by the caller.
Another approach could have been to use a distinct list or distinct
scope but that sounds complicated and bug-prone.
Passing this flag to var_set() will result in the variable to only be
created if it did not exist, otherwise nothing is done (it's not even
updated). This will be used for pre-registering names.
When setting variables, there are currently two variants, one which will
always create the variable, and another one, "ifexist", which will only
create or update a variable if a similarly named variable in any scope
already existed before.
The goal was to limit the risk of injecting random names in the proc
scope, but it was achieved by making use of the somewhat limited name
indexing model, which explains the scope-agnostic restriction.
With this change, we're moving the check downwards in the chain, at the
variable level, and only variables under the scope "proc" will be subject
to the restriction. A new set of VF_* flags was added to adjust how
variables are set, and VF_UPDATEONLY is used to mention this restriction.
In this exact state of affairs, this is not completely exact, as if a
similar name was not known in any scope, the variable will continue to
be rejected like before, but this will change soon.
The vars_init() name is particularly confusing as it does not initialize
the variables code but the head of a list of variables passed in
arguments. And we'll soon need to have proper initialization code, so
let's rename it now.
In ticket #1348 some users expressed some concerns regarding the removal
of the "grace" directive from the proxies. Their use case very closely
mimmicks the original intent of the grace keyword, which is, let haproxy
accept traffic for some time when stopping, while indicating an external
LB that it's stopping.
This is implemented here by starting a task whose expiration triggers
the soft-stop for real. The global "stopping" variable is immediately
set however. For example, this below will be sufficient to instantly
notify an external check on port 9999 that the service is going down,
while other services remain active for 10s:
global
grace 10s
frontend ext-check
bind :9999
monitor-uri /ext-check
monitor fail if { stopping }
Ori Hollander of JFrog Security reported that htx_add_header() and
htx_add_trailer() were missing a length check on the header name. While
this does not allow to overwrite any memory area, it results in bits of
the header name length to slip into the header value length and may
result in forging certain header names on the input. The sad thing here
is that a FIXME comment was present suggesting to add the required length
checks :-(
The injected headers are visible to the HTTP internals and to the config
rules, so haproxy will generally stay synchronized with the server. But
there is one exception which is the content-length header field, because
it is already deduplicated on the input, but before being indexed. As
such, injecting a content-length header after the deduplication stage
may be abused to present a different, shorter one on the other side and
help build a request smuggling attack, or even maybe a response splitting
attack. CVE-2021-40346 was assigned to this problem.
As a mitigation measure, it is sufficient to verify that no more than
one such header is present in any message, which is normally the case
thanks to the duplicate checks:
http-request deny if { req.hdr_cnt(content-length) gt 1 }
http-response deny if { res.hdr_cnt(content-length) gt 1 }
This must be backported to all HTX-enabled versions, hence as far as 2.0.
In 2.3 and earlier, the functions are in src/htx.c instead.
Many thanks to Ori for his work and his responsible report!
Since commit "BUG/MINOR: config: reject configs using HTTP with bufsize
>= 256 MB" we are now sure that it's not possible anymore to have an HTX
block of a size 256 MB or more, even after concatenation thanks to the
tests for len >= htx_free_data_space(). Let's remove these now obsolete
comments.
A BUG_ON() was added in htx_add_blk() to track any such exception if
the conditions would change later, to complete the one that is performed
on the start address that must remain within the buffer.
In preparation for support default values when fetching variables, we
need to update the internal API to pass an extra argument to functions
vars_get_by_{name,desc} to provide an optional default value. This
patch does this and always passes NULL in this argument. var_to_smp()
was extended to fall back to this value when available.
The set-var() action is convenient because it preserves the input type
but it's a pain to deal with when trying to concatenate values. The
most recurring example is when it's needed to build a variable composed
of the source address and the source port. Usually it ends up like this:
tcp-request session set-var(sess.port) src_port
tcp-request session set-var(sess.addr) src,concat(":",sess.port)
This is even worse when trying to aggregate multiple fields from stick-table
data for example. Due to this a lot of users instead abuse headers from HTTP
rules:
http-request set-header(x-addr) %[src]:%[src_port]
But this requires some careful cleanups to make sure they won't leak, and
it's significantly more expensive to deal with. And generally speaking it's
not clean. Plus it must be performed for each and every request, which is
expensive for this common case of ip+port that doesn't change for the whole
session.
This patch addresses this limitation by implementing a new "set-var-fmt"
action which performs the same work as "set-var" but takes a format string
in argument instead of an expression. This way it becomes pretty simple to
just write:
tcp-request session set-var-fmt(sess.addr) %[src]:%[src_port]
It is usable in all rulesets that already support the "set-var" action.
It is not yet implemented for the global "set-var" directive (which already
takes a string) and the CLI's "set var" command, which would definitely
benefit from it but currently uses its own parser and engine, thus it
must be reworked.
The doc and regtests were updated.
For a long time we couldn't have arguments in expressions used in
tcp-request, tcp-response etc rules. But now due to the variables
it's possible, and their context in case of failure to resolve an
argument (e.g. backend name not found) is not properly reported
because there is no arg context values in ARGC_* to report them.
Let's add a number of missing ones for tcp-request {connection,
session,content}, tcp-response content, tcp-check, the config
parser (for "set-var" in the global section) and the CLI parser
(for "set-var" on the CLI).
Sometimes it is convenient to remap large sets of URIs to new ones (e.g.
after a site migration for example). This can be achieved using
"http-request redirect" combined with maps, but one difficulty there is
that non-matching entries will return an empty response. In order to
avoid this, duplicating the operation as an ACL condition ending in
"-m found" is possible but it becomes complex and error-prone while it's
known that an empty URL is not valid in a location header.
This patch addresses this by improving the redirect rules to be able to
simply ignore the rule and skip to the next one if the result of the
evaluation of the "location" expression is empty. However in order not
to break existing setups, it requires a new "ignore-empty" keyword.
There used to be an ACT_FLAG_FINAL on redirect rules that's used during
the parsing to emit a warning if followed by another rule, so here we
only set it if the option is not there. The http_apply_redirect_rule()
function now returns a 3rd value to mention that it did nothing and
that this was not an error, so that callers can just ignore the rule.
The regular "redirect" rules were not modified however since this does
not apply there.
The map_redirect VTC was completed with such a test and updated to 2.5
and an example was added into the documentation.
The locking in the dequeuing process was significantly improved by commit
49667c14b ("MEDIUM: queue: take the proxy lock only during the px queue
accesses") in that it tries hard to limit the time during which the
proxy's queue lock is held to the strict minimum. Unfortunately it's not
enough anymore, because we take up the task and manipulate a few pendconn
elements after releasing the proxy's lock (while we're under the server's
lock) but the task will not necessarily hold the server lock since it may
not have successfully found one (e.g. timeout in the backend queue). As
such, stream_free() calling pendconn_free() may release the pendconn
immediately after the proxy's lock is released while the other thread
currently proceeding with the dequeuing tries to wake up the owner's
task and dies in task_wakeup().
One solution consists in releasing le proxy's lock later. But tests have
shown that we'd have to sacrifice a significant share of the performance
gained with the patch above (roughly a 20% loss).
This patch takes another approach. It adds a "del_lock" to each pendconn
struct, that allows to keep it referenced while the proxy's lock is being
released. It's mostly a serialization lock like a refcount, just to maintain
the pendconn alive till the task_wakeup() call is complete. This way we can
continue to release the proxy's lock early while keeping this one. It had
to be added to the few points where we're about to free a pendconn, namely
in pendconn_dequeue() and pendconn_unlink(). This way we continue to
release the proxy's lock very early and there is no performance degradation.
This lock may only be held under the queue's lock to prevent lock
inversion.
No backport is needed since the patch above was merged in 2.5-dev only.
This option can be used to define a specific log format that will be
used in case of error, timeout, connection failure on a frontend... It
will be used for any log line concerned by the log-separate-errors
option. It will also replace the format of specific error messages
decribed in section 8.2.6.
If no "error-log-format" is defined, the legacy error messages are still
emitted and the other error logs keep using the regular log-format.
Other build warnings were emitted on LIBRESSL_VERSION_NUMBER with -Wundef
under openssl < 1.1. Related to GH issue #1369. Seems like some of them
could be simplified a little bit.
Openssl-compat emits a warning for the test on LIBRESSL_VERSION that might
be underfined, if built with -Wundef. The fix is easy, let's do it. Related
to GH issue #1369.
As reported in GH issue #1369, there is a single case of #if with a
possibly undefined value in defaults.h which is on MAXHOSTNAMELEN. Let's
turn it to a #ifdef.
The code used to rely on BITS_PER_LONG to decide on the most efficient
way to perform a 64-bit shift, but this macro is not defined (at best
it's __BITS_PER_LONG) and it's likely that it's been like this since
the early implementation of ebtrees designed on i386. Let's remove the
test on this macro and rely on sizeof(long) instead, it also has the
benefit of letting the compiler validate the two branches.
This can be backported to all versions. Thanks to Ezequiel Garcia for
reporting this one in issue #1369.
Before threads were introduced in 1.8, idle_pct used to be a global
variable indicating the overall process idle time. Threads made it
thread-local, meaning that its reporting in the stats made little
sense, though this was not easy to spot. In 2.0, the idle_pct variable
moved to the struct thread_info via commit 81036f273 ("MINOR: time:
move the cpu, mono, and idle time to thread_info"). It made it more
obvious that the idle_pct was per thread, and also allowed to more
accurately measure it. But no more effort was made in that direction.
This patch introduces a new report_idle() function that accurately
averages the per-thread idle time over all running threads (i.e. it
should remain valid even if some threads are paused or stopped), and
makes use of it in the stats / "show info" reports.
Sending traffic over only two connections of an 8-thread process
would previously show this erratic CPU usage pattern:
$ while :; do socat /tmp/sock1 - <<< "show info"|grep ^Idle;sleep 0.1;done
Idle_pct: 30
Idle_pct: 35
Idle_pct: 100
Idle_pct: 100
Idle_pct: 100
Idle_pct: 100
Idle_pct: 100
Idle_pct: 100
Idle_pct: 35
Idle_pct: 33
Idle_pct: 100
Idle_pct: 100
Idle_pct: 100
Idle_pct: 100
Idle_pct: 100
Idle_pct: 100
Now it shows this more accurate measurement:
$ while :; do socat /tmp/sock1 - <<< "show info"|grep ^Idle;sleep 0.1;done
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
Idle_pct: 83
This is not technically a bug but this lack of precision definitely affects
some users who rely on the idle_pct measurement. This should at least be
backported to 2.4, and might be to some older releases depending on users
demand.
In 2.4 we extended the max poll time from 1s to 60s with commit
4f59d3861 ("MINOR: time: increase the minimum wakeup interval to 60s").
This had the consequence that the calculation of the idle time percentage
may overflow during the multiply by 100 if the thread had slept 43s or
more. Let's change this to a 64 bit computation. This will have no
performance impact since this is done at most twice per second.
This should fix github issue #1366.
This must be backported to 2.4.
To be able to provide JA3 compatible TLS Fingerprints we need to expose
all Client Hello captured data using fetchers. Patch provides new
and modifies existing fetchers to add ability to filter out GREASE values:
- ssl_fc_cipherlist_*
- ssl_fc_ecformats_bin
- ssl_fc_eclist_bin
- ssl_fc_extlist_bin
- ssl_fc_protocol_hello_id
When we set tune.ssl.capture-cipherlist-size to a non-zero value
we are able to capture cipherlist supported by the client. To be able to
provide JA3 compatible TLS fingerprinting we need to capture more
information from Client Hello message:
- SSL Version
- SSL Extensions
- Elliptic Curves
- Elliptic Curve Point Formats
This patch allows HAProxy to capture such information and store it for
later use.
There are regularly places, especially in config analysis, where we
need to report certain things (warnings or errors) only once, but
where implementing a counter is sufficiently deterrent so that it's
not done.
Let's add a simple ONLY_ONCE() macro that implements a static variable
(char) which is atomically turned on, and returns true if it's set for
the first time. This uses fairly compact code, a single byte of BSS
and is thread-safe. There are probably a number of places in the config
parser where this could be used. It may also be used to implement a
WARN_ON() similar to BUG_ON() but which would only warn once.
Define a flag to mark a server as non purgeable. This flag will be used
for "delete server" CLI handler. All servers without this flag will be
eligible to runtime suppression.
In a future patch, it will be possible to remove at runtime every
servers, both static and dynamic. This requires to extend the server
refcount for all instances.
First, refcount manipulation functions have been renamed to better
express the API usage.
* srv_refcount_use -> srv_take
The refcount is always initialize to 1 on the server creation in
new_server. It's also incremented for each check/agent configured on a
server instance.
* free_server -> srv_drop
This decrements the refcount and if null, the server is freed, so code
calling it must not use the server reference after it. As a bonus, this
function now returns the next server instance. This is useful when
calling on the server loop without having to save the next pointer
before each invocation.
In these functions, remove the checks that prevent refcount on
non-dynamic servers. Each reference to "dynamic" in variable/function
naming have been eliminated as well.
A dynamic server may be deleted at runtime at the same moment when the
stats applet is pointing to it. Use the server refcount to prevent
deletion in this case.
This should be backported up to 2.4, with an observability period of 2
weeks. Note that it requires the dynamic server refcounting feature
which has been implemented on 2.5; the following commits are required :
- MINOR: server: implement a refcount for dynamic servers
- BUG/MINOR: server: do not use refcount in free_server in stopping mode
- MINOR: server: return the next srv instance on free_server
As a convenience, return the next server instance from servers list on
free_server.
This is particularily useful when using this function on the servers
list without having to save of the next pointer before calling it.
Implements a way of checking the running openssl version:
If the OpenSSL support was not compiled within HAProxy it will returns a
error, so it's recommanded to do a SSL feature check before:
$ ./haproxy -cc 'feature(OPENSSL) && openssl_version_atleast(0.9.8zh) && openssl_version_before(3.0.0)'
This will allow to select the SSL reg-tests more carefully.
Include the correct .h files in http_client.c and http_client.h.
The api.h is needed in http_client.c and http_client-t.h is now include
directly from http_client.h
The X509_STORE_CTX_get0_cert did not exist yet on OpenSSL 1.0.2 and
neither did X509_STORE_CTX_get0_chain, which was not actually needed
since its get1 equivalent already existed.
Most of the SSL sample fetches related to the client certificate were
based on the SSL_get_peer_certificate function which returns NULL when
the verification process failed. This made it impossible to use those
fetches in a log format since they would always be empty.
The patch adds a reference to the X509 object representing the client
certificate in the SSL structure and makes use of this reference in the
fetches.
The reference can only be obtained in ssl_sock_bind_verifycbk which
means that in case of an SSL error occurring before the verification
process ("no shared cipher" for instance, which happens while processing
the Client Hello), we won't ever start the verification process and it
will be impossible to get information about the client certificate.
This patch also allows most of the ssl_c_XXX fetches to return a usable
value in case of connection failure (because of a verification error for
instance) by making the "conn->flags & CO_FL_WAIT_XPRT" test (which
requires a connection to be established) less strict.
Thanks to this patch, a log-format such as the following should return
usable information in case of an error occurring during the verification
process :
log-format "DN=%{+Q}[ssl_c_s_dn] serial=%[ssl_c_serial,hex] \
hash=%[ssl_c_sha1,hex]"
It should answer to GitHub issue #693.
This commit implements a very simple HTTP Client API.
A client can be operated by several functions:
- httpclient_new(), httpclient_destroy(): create
and destroy the struct httpclient instance.
- httpclient_req_gen(): generate a complete HTX request using the
the absolute URL, the method and a list of headers. This request
is complete and sets the HTX End of Message flag. This is limited
to small request we don't need a body.
- httpclient_start() fill a sockaddr storage with a IP extracted
from the URL (it cannot resolve an fqdm for now), start the
applet. It also stores the ptr of the caller which could be an
appctx or something else.
- hc->ops contains a list of callbacks used by the
HTTPClient, they should be filled manually after an
httpclient_new():
* res_stline(): the client received a start line, its content
will be stored in hc->res.vsn, hc->res.status, hc->res.reason
* res_headers(): the client received headers, they are stored in
hc->res.hdrs.
* res_payload(): the client received some payload data, they are
stored in the hc->res.buf buffer and could be extracted with the
httpclient_res_xfer() function, which takes a destination buffer
as a parameter
* res_end(): this callback is called once we finished to receive
the response.
While http_parse_scheme() extracts a scheme from a URI by extracting
exactly the valid characters and stopping on delimiters, this new
function performs the same on a fixed-size string.
This new class exposes methods to manipulate HTTP messages from a filter
written in lua. Like for the HTTP class, there is a bunch of methods to
manipulate the message headers. But there are also methods to manipulate the
message payload. This part is similar to what is available in the Channel
class. Thus the payload can be duplicated, erased, modified or
forwarded. For now, only DATA blocks can be retrieved and modified because
the current API is limited. No HTTPMessage method is able to yield. Those
manipulating the headers are always called on messages containing all the
headers, so there is no reason to yield. Those manipulating the payload are
called from the http_payload filters callback function where yielding is
forbidden.
When an HTTPMessage object is instantiated, the underlying Channel object
can be retrieved via the ".channel" field.
For now this class is not used because the HTTP filtering is not supported
yet. It will be the purpose of another commit.
There is no documentation for now.
A lua TXN can be created when a sample fetch, an action or a filter callback
function is executed. A flag is now used to track the execute context.
Respectively, HLUA_TXN_SMP_CTX, HLUA_TXN_ACT_CTX and HLUA_TXN_FLT_CTX. The
filter flag is not used for now.
When a script is executed, a flag is used to allow it to yield. An error is
returned if a lua function yield, explicitly or not. But there is no way to
get this capability in C functions. So there is no way to choose to yield or
not depending on this capability.
To fill this gap, the flag HLUA_NOYIELD is introduced and added on the lua
context if the current script execution is not authorized to yield. Macros
to set, clear and test this flags are also added.
This feature will be usefull to fix some bugs in lua actions execution.
Implement a mechanism to free a started check on runtime for dynamic
servers. A new function check_purge is created for this. The check task
will be marked for deletion and scheduled to properly close connection
elements and free the task/tasklet/buf_wait elements.
This function will be useful to delete a dynamic server wich checks.
It is necessary to have a refcount mechanism on dynamic servers to be
able to enable check support. Indeed, when deleting a dynamic server
with check activated, the check will be asynchronously removed. This is
mandatory to properly free the check resources in a thread-safe manner.
The server instance must be kept alive for this.
Remove static qualifier on init_srv_check, init_srv_agent_check and
start_check_task. These functions will be called in server.c for dynamic
servers with checks.
Implement an equivalent of task_kill for tasklets. This function can be
used to request a tasklet deletion in a thread-safe way.
Currently this function is unused.
There was no way to access the SPOE filter configuration from the agent
object. However it could be handy to have it. And in fact, this will be
required to fix a bug.
Right now we're using a DWCAS to atomically set the running_mask while
being constrained by the thread_mask. This DWCAS is annoying because we
may seriously need it later when adding support for thread groups, for
checking that the running_mask applies to the correct group.
It turns out that the DWCAS is not strictly necessary because we never
need it to set the thread_mask based on the running_mask, only the other
way around. And in fact, the running_mask is always cleared alone, and
the thread_mask is changed alone as well. The running_mask is only
relevant to indicate a takeover when the thread_mask matches it. Any
bit set in running and not present in thread_mask indicates a transition
in progress.
As such, it is possible to re-arrange this by using a regular CAS around a
consistency check between running_mask and thread_mask in fd_update_events
and by making a CAS on running_mask then an atomic store on the thread_mask
in fd_takeover(). The only other case is fd_delete() but that one already
sets the running_mask before clearing the thread_mask, which is compatible
with the consistency check above.
This change has happily survived 10 billion takeovers on a 16-thread
machine at 800k requests/s.
The fd-migration doc was updated to reflect this change.
This one is set whenever an FD is reported by a poller with a null owner,
regardless of the thread_mask. It has become totally meaningless because
it only indicates a migrated FD that was not yet reassigned to a thread,
but as soon as a thread uses it, the status will change to skip_fd. Thus
there is no reason to distinguish between the two, it adds more confusion
than it helps. Let's simply drop it.
The current principle of running under isolation was made to access
sensitive data while being certain that no other thread was using them
in parallel, without necessarily having to place locks everywhere. The
main use case are "show sess" and "show fd" which run over long chains
of pointers.
The thread_isolate() call relies on the "harmless" bit that indicates
for a given thread that it's not currently doing such sensitive things,
which is advertised using thread_harmless_now() and which ends usings
thread_harmless_end(), which also waits for possibly concurrent threads
to complete their work if they took this opportunity for starting
something tricky.
As some system calls were notoriously slow (e.g. mmap()), a bunch of
thread_harmless_now() / thread_harmless_end() were placed around them
to let waiting threads do their work while such other threads were not
able to modify memory contents.
But this is not sufficient for performing memory modifications. One such
example is the server deletion code. By modifying memory, it not only
requires that other threads are not playing with it, but are not either
in the process of touching it. The fact that a pool_alloc() or pool_free()
on some structure may call thread_harmless_now() and let another thread
start to release the same object's memory is not acceptable.
This patch introduces the concept of "idle threads". Threads entering
the polling loop are idle, as well as those that are waiting for all
others to become idle via the new function thread_isolate_full(). Once
thread_isolate_full() is granted, the thread is not idle anymore, and
it is released using thread_release() just like regular isolation. Its
users have to keep in mind that across this call nothing is granted as
another thread might have performed shared memory modifications. But
such users are extremely rare and are actually expecting this from their
peers as well.
Note that that in case of backport, this patch depends on previous patch:
MINOR: threads: make thread_release() not wait for other ones to complete
This patch splits the disabled state of a proxy into a PR_DISABLED and a
PR_STOPPED state.
The first one is set when the proxy is disabled in the configuration
file, and the second one is set upon a stop_proxy().
Rename the 'dontloglegacyconnerr' option to 'log-error-via-logformat'
which is much more self-explanatory and readable.
Note: only legacy keywords don't use hyphens, it is recommended to
separate words with them in new keywords.
The x86-tso model makes the load and store barriers unneeded for our
usage as long as they perform at least a compiler barrier: the CPU
will respect store ordering and store vs load ordering. It's thus
safe to remove the lfence and sfence which are normally needed only
to communicate with external devices. Let's keep the mfence though,
to make sure that reads of same memory location after writes report
the value from memory and not the one snooped from the write buffer
for too long.
An in-depth review of all use cases tends to indicate that this is
okay in the rest of the code. Some parts could be cleaned up to
use atomic stores and atomic loads instead of explicit barriers
though.
Doing this reliably increases the overall performance by about 2-2.5%
on a 8c-16t Xeon thanks to less frequent flushes (it's likely that the
biggest gain is in the MT lists which use them a lot, and that this
results in less cache line flushes).
The atomic_load/atomic_store/atomic_xchg operations were all forced to
__ATOMIC_SEQ_CST, which results in explicit store or even full barriers
even on x86-tso while we do not need them: we're not communicating with
external devices for example and are only interested in respecting the
proper ordering of loads and stores between each other.
These ones being rarely used, the emitted code on x86 remains almost the
same (barring a handful of locations). However they will allow to place
correct barriers at other places where atomics are accessed a bit lightly.
The patch is marked medium because we can never rule out the risk of some
bugs on more relaxed platforms due to the rest of the code.
update_freq_ctr_period() was using relaxed atomics without using barriers,
which usually works fine on x86 but not everywhere else. In addition, some
values were read without being enclosed by barriers, allowing the compiler
to possibly prefetch them a bit earlier. Finally, freq_ctr_total() was also
reading these without enough barriers. Let's make explicit use of atomic
loads and atomic stores to get rid of this situation. This required to
slightly rearrange the freq_ctr_total() loop, which could possibly slightly
improve performance under extreme contention by avoiding to reread all
fields.
A backport may be done to 2.4 if a problem is encountered, but last tests
on arm64 with LSE didn't show any issue so this can possibly stay as-is.
This function already performs a number of checks prior to calling the
IOCB, and detects the change of thread (FD migration). Half of the
controls are still in each poller, and these pollers also maintain
activity counters for various cases.
Note that the unreliable test on thread_mask was removed so that only
the one performed by fd_set_running() is now used, since this one is
reliable.
Let's centralize all that fd-specific logic into the function and make
it return a status among:
FD_UPDT_DONE, // update done, nothing else to be done
FD_UPDT_DEAD, // FD was already dead, ignore it
FD_UPDT_CLOSED, // FD was closed
FD_UPDT_MIGRATED, // FD was migrated, ignore it now
Some pollers already used to call it last and have nothing to do after
it, regardless of the result. epoll has to delete the FD in case a
migration is detected. Overall this removes more code than it adds.
Since 2.4 with commit f50906519 ("MEDIUM: fd: merge fdtab[].ev and state
for FD_EV_* and FD_POLL_* into state") we can merge all flag updates at
once in fd_update_events(). Previously this was performed in 1 to 3 steps,
setting the polling state, then setting READY_R if in/err/hup, and setting
READY_W if out/err. But since the commit above, all flags are stored
together in the same structure field that is being updated with the new
flags, thus we can simply update the flags altogether and avoid multiple
atomic operations. This even removes the need for atomic ops for FDs that
are not shared.
There's a theoretical race (that we failed to trigger) in function
fd_update_events(), which could strike on idle connections. The "locked"
variable will most often be 0 as the FD is bound to the current thread
only. Another thread could take it over once "locked" is set, change
the thread and running masks. Then the first thread updates the FD's
state non-atomically and possibly overwrites what the other thread was
preparing. It still looks like the FD's state will ultimately converge
though.
The solution against this is to set the running flag earlier so that a
takeover() attempt cannot succeed, or that the fd_set_running() attempt
fails, indicating that nothing needs to be done on this FD.
While this is sufficient for a simple fix to be backported, it leaves
the FD actively polled in the calling thread, this will trigger a second
wakeup which will notice the absence of tid_bit in the thread_mask,
getting rid of it.
A more elaborate solution would consist in calling fd_set_running()
directly from the pollers before calling fd_update_events(), getting
rid of the thread_mask test and letting the caller eliminate that FD
from its list if needed.
Interestingly, this code also proves to be suboptimal in that it sets
the FD state twice instead of calculating the new state at once and
always using a CAS to set it. This is a leftover of a simplification
that went into 2.4 and which should be explored in a future patch.
This may be backported as far as 2.2.
The takeover of idle conns between threads is particularly tricky, for
two reasons:
- there's no way to atomically synchronize kernel-side polling with
userspace activity, so late events will always be reported for some
FDs just migrated ;
- upon error, an FD may be immediately reassigned to whatever other
thread since it's process-wide.
The current model uses the FD's thread_mask to figure if an FD still
ought to be reported or not, and a per-thread idle connection queue
from which eligible connections are atomically added/picked. I/Os
coming from the bottom for such a connection must remove it from the
list so that it's not elected. Same for timeout tasks and iocbs. And
these last ones check their context under the idle_conn lock to judge
if they're still allowed to run.
One rare case was omitted: the wake() callback. This one is rare, it
may serve to notify about finalized connect() calls that are not being
polled, as well as unhandled shutdowns and errors. This callback was
not protected till now because it wasn't seen as sensitive, but there
exists a particular case where it may be called without protectoin in
parallel to a takeover. This happens in the following sequence:
- thread T1 wants to establish an outgoing connection
- the connect() call returns EINPROGRESS
- the poller adds it using epoll_ctl()
- epoll_wait() reports it, connect() is done. The connection is
not being marked as actively polled anymore but is still known
from the poller.
- the request is sent over that connection using send(), which
queues to system buffers while data are being delivered
- the scheduler switches to other tasks
- the request is physically sent
- the server responds
- the stream is notified that send() succeeded, and makes progress,
trying to recv() from that connection
- the recv() succeeds, the response is delivered
- the poller doesn't need to be touched (still no active polling)
- the scheduler switches to other tasks
- the server closes the connection
- the poller on T1 is notified of the SHUTR and starts to call mux->wake()
- another thread T2 takes over the connection
- T2 continues to run inside wake() and releases the connection
- T2 is just dereferencing it.
- BAM.
The most logical solution here is to surround the call to wake() with an
atomic removal/insert of the connection from/into the idle conns lists.
This way, wake() is guaranteed to run alone. Any other poller reporting
the FD will not have its tid_bit in the thread_mask si will not bother
it. Another thread trying a takeover will not find this connection. A
task or tasklet being woken up late will either be on the same thread,
or be called on another one with a NULL context since it will be the
consequence of previous successful takeover, and will be ignored. Note
that the extra cost of a lock and tree access here have a low overhead
which is totally amortized given that these ones roughly happen 1-2
times per connection at best.
While it was possible to crash the process after 10-100k req using H2
and a hand-refined configuration achieving perfect synchronism between
a long (20+) chain of proxies and a short timeout (1ms), now with that
fix this never happens even after 10M requests.
Many thanks to Olivier for proposing this solution and explaining why
it works.
This should be backported as far as 2.2 (when inter-thread takeover was
introduced). The code in older versions will be found in conn_fd_handler().
A workaround consists in disabling inter-thread pool sharing using:
tune.idle-pool.shared off
In case of connection failure, a dedicated error message is output,
following the format described in section "Error log format" of the
documentation. These messages cannot be configured through a log-format
option.
This patch adds a new option, "dontloglegacyconnerr", that disables
those error logs when set, and "replaces" them by a regular log line
that follows the configured log-format (thanks to a call to sess_log in
session_kill_embryonic).
The new fc_conn_err sample fetch allows to add the legacy error log
information into a regular log format.
This new option is unset by default so the logging logic will remain the
same until this new option is used.
This new sample fetch along the ssl_fc_hsk_err_str fetch contain the
last SSL error of the error stack that occurred during the SSL
handshake (from the frontend's perspective). The errors happening during
the client's certificate verification will still be given by the
ssl_c_err and ssl_c_ca_err fetches. This new fetch will only hold errors
retrieved by the OpenSSL ERR_get_error function.
The fc_conn_err and fc_conn_err_str sample fetches give information
about the problem that made the connection fail. This information would
previously only have been given by the error log messages meaning that
thanks to these fetches, the error log can now be included in a custom
log format. The log strings were all found in the conn_err_code_str
function.
The CO_ER_SSL_EARLY_FAILED and CO_ER_CIP_TIMEOUT connection error codes
were missing in the conn_err_code_str switch which converts the error
codes into string.
This patch can be backported on all stable branches.
This patch renames the proxy capability "LUA" to "INT" so it could be
used for any internal proxy.
Every proxy that are not user defined should use this flag.
Oss-fuzz reports in issue 36328 that we can recurse too far by passing
extremely deep expressions to the ".if" parser. I thought we were still
limited to the 1024 chars per line, that would be highly sufficient, but
we don't have any limit now :-/
Let's just pass a maximum recursion counter to the recursive parsers.
It's decremented for each call and the expression fails if it reaches
zero. On the most complex paths it can add 3 levels per parenthesis,
so with a limit of 1024, that's roughly 343 nested sub-expressions that
are supported in the worst case. That's more than sufficient, for just
a few kB of RAM.
No backport is needed.
This option had always been broken in HTX, which means that the first
breakage appeared in 1.9, that it was broken by default in 2.0 and that
no workaround existed starting with 2.1. The way this option works is
praticularly unfit to the rest of the configuration and to the internal
architecture. It had some uses when it was introduced 14 years ago but
nowadays it's possible to do much better and more reliable using a
set of "http-request set-dst" and "http-request set-uri" rules, which
additionally are compatible with DNS resolution (via do-resolve) and
are not exclusive to normal load balancing. The "option-http_proxy"
example config file was updated to reflect this.
The option is still parsed so that an error message gives hints about
what to look for.
The cfg_free_cond_{term,and,expr}() functions used to take a pointer to
the pointer to be freed in order to replace it with a NULL once done.
But this doesn't cope well with freeing lists as it would require
recursion which the current code tried to avoid.
Let's just change the API to free the area and let the caller set the NULL.
This leak was reported by oss-fuzz (issue 36265).
Now it's possible to form a term using parenthesis around an expression.
This will soon allow to build more complex expressions. For now they're
still pretty limited but parenthesis do work.
Now evaluating a condition will rely on an expression (or an empty string),
and this expression will support ORing a sub-expression with another
optional expression. The sub-expressions ANDs a term with another optional
sub-expression. With this alone precedence between && and || is respected,
and the following expression:
A && B && C || D || E && F || G
will naturally evaluate as:
(A && B && C) || D || (E && F) || G
It's not convenient to let the caller be responsible for node allocation,
better have the leaf function do that and implement the accompanying free
call. Now only a pointer is needed instead of a struct, and the leaf
function makes sure to leave the situation in a consistent way.
The purpose is to build a descendent parser that will split conditions
into expressions made of terms. There are two phases, a parsing phase
and an evaluation phase. Strictly speaking it's not required to cut
that in two right now, but it's likely that in the future we won't want
certain predicates to be evaluated during the parsing (e.g. file system
checks or execution of some external commands).
The cfg_eval_condition() function is now much simpler, it just tries to
parse a single term, and if OK evaluates it, then returns the result.
Errors are unchanged and may still be reported during parsing or
evaluation.
It's worth noting that some invalid expressions such as streq(a,b)zzz
continue to parse correctly for now (what remains after the parenthesis
is simply ignored as not necessary).
The .if/.else/.endif and condition evaluation code is quite dirty and
was dumped into cfgparse.c because it was easy. But it should be tidied
quite a bit as it will need to evolve.
Let's move all that to cfgcond.{c,h}.
make_arg_list() can create an array of arguments, some of which remain
to be resolved, but all users had to deal with their own roll back on
error. Let's add a free_args() function to release all the array's
elements and let the caller deal with the array itself (sometimes it's
allocated in the stack).
Since 1.9 with commit 673867c35 ("MAJOR: applets: Use tasks, instead
of rolling our own scheduler.") the thread_mask field of the appctx
became unused, but the code hadn't been cleaned for this. The appctx
has its own task and the task's thread_mask is the one to be displayed.
It's worth noting that all calls to appctx_new() pass tid_bit as the
thread_mask. This makes sense, and it could be convenient to decide
that this becomes the norm and to simplify the API.
Define a new global config statement named
"h2-workaround-bogus-websocket-clients".
This statement will disable the automatic announce of h2 websocket
support as specified in the RFC8441. This can be use to overcome clients
which fail to implement the relatively fresh RFC8441. Clients will in
his case automatically downgrade to http/1.1 for the websocket tunnel
if the haproxy configuration allows it.
This feature is relatively simple and can be backported up to 2.4, which
saw the introduction of h2 websocket support.
Replace http_get_path by the http_uri_parser API. The new functions is
renamed http_parse_path. Replace duplicated code for scheme and
authority parsing by invocations to http_parse_scheme/authority.
If no scheme is found for an URI detected as an absolute-uri/authority,
consider it to be an authority format : no path will be found. For an
absolute-uri or absolute-path, use the remaining of the string as the
path. A new http_uri_parser state is declared to mark the path parsing
as done.
Replace http_get_authority by the http_uri_parser API.
The new function is renamed http_parse_authority. Replace duplicated
scheme parsing code by http_parse_scheme invocation. A new
http_uri_parser state is declared to mark the authority parsing as done.
Replace http_get_scheme by the http_uri_parser API. The new function is
renamed http_parse_scheme. A new http_uri_parser state is declared to
mark the scheme parsing as completed.
Implement a http uri parser type. This type will be used as a context to
parse the various elements of an uri.
The goal of this serie of patches is to factorize duplicated code
between the http_get_scheme/authority/path functions. A simple parsing
API is designed to be able to extract once each element of an HTTP URI
in order. The functions will be renamed in the following patches to
reflect the API change with the prefix http_parse_*.
For the parser API, the http_uri_parser type must first be
initialized before usage. It will register the URI to parse and detect
its format according to the rfc 7230.
Implement the scheme-based uri normalization as described in rfc3986
6.3.2. Its purpose is to remove the port of an uri if the default one is
used according to the uri scheme : 80/http and 443/https. All other
ports are not touched.
This method uses an htx message as an input. It requires that the target
URI is in absolute-form with a http/https scheme. This represents most
of h2 requests except CONNECT. On the contrary, most of h1 requests
won't be elligible as origin-form is the standard case.
The normalization is first applied on the target URL of the start line.
Then, it is conducted on every Host headers present, assuming that they
are equivalent to the target URL.
This change will be notably useful to not confuse users who are
accustomed to use the host for routing without specifying default ports.
This problem was recently encountered with Firefox which specify the 443
default port for http2 websocket Extended CONNECT.
This patch adds the definition of two new array data_types:
'gpc': This is an array of 32bits General Purpose Counters.
'gpc_rate': This is an array on increment rates of General Purpose Counters.
Like for all arrays, they are limited to 100 elements.
This patch also adds actions and fetches to handle
elements of those arrays.
Note: As documented, those new actions and fetches won't
apply to the legacy 'gpc0', 'gpc1', 'gpc0_rate' nor 'gpc1_rate'.
This patch adds the definition of a new array data_type
'gpt'. This is an array of 32bits General Purpose Tags.
Like for all arrays, it is limited to 100 elements.
This patch also adds actions and fetches to handle
elements of this array.
Note: As documented, those new actions and fetches won't
apply to the legacy 'gpt0' data type.
This patch adds support of array data_types on the peer protocol.
The table definition message will provide an additionnal parameter
for array data-types: the number of elements of the array.
In case of array of frqp it also provides a second parameter:
the period used to compute freq counter.
The array elements are std_type values linearly encoded in
the update message.
Note: if a remote peer announces an array data_type without
parameters into the table definition message, all updates
on this table will be ignored because we can not
parse update messages consistently.
This patch provides the code to handle arrays of some
standard types (SINT, UINT, ULL and FRQP) in stick table.
This way we could define new "array" data types.
Note: the number of elements of an array was limited
to 100 to put a limit and to ensure that an encoded
update message will continue to fit into a buffer
when the peer protocol will handle such data types.
This patch replaces all advanced data type aliases on
stktable_data_cast calls by standard types.
This way we could call the same stktable_data_cast
regardless of the used advanced data type as long they
are using the same std type.
It also removes all the advanced data type aliases.
It is now possible to set the Netfilter MARK and the TOS field value in all
packets sent to the client from any tcp-request rulesets or the "tcp-response
content" one. To do so, the parsing of "set-mark" and "set-tos" actions are
moved in tcp_act.c and the actions evaluation is handled in dedicated functions.
This patch may be backported as far as 2.2 if necessary.
It is now possible to set the "nice" factor of the current stream from a
"tcp-request content" or "tcp-response content" ruleset. To do so, the
action parsing is moved in stream.c and the action evaluation is handled in
a dedicated function.
This patch may be backported as far as 2.2 if necessary.
It is now possible to set the stream log level from a "tcp-request content"
or "tcp-response content" ruleset. To do so, the action parsing is moved in
stream.c and the action evaluation is handled in a dedicated function.
This patch should fix issue #1306. It may be backported as far as 2.2 if
necessary.
Now we directly use p->queue to get to the queue, which is much more
straightforward. The performance on 100 servers and 16 threads
increased from 560k to 574k RPS, or 2.5%.
A lot more simplifications are possible, but the minimum was done at
this point.
A queue is specific to a server or a proxy, so we don't need to place
this distinction inside all pendconns, it can be in the queue itself.
This commit adds the relevant fields "px" and "sv" into the struct
queue, and initializes them accordingly.
This basically undoes the API changes that were performed by commit
0274286dd ("BUG/MAJOR: server: fix deadlock when changing maxconn via
agent-check") to address the deadlock issue: since process_srv_queue()
doesn't use the server lock anymore, it doesn't need the "server_locked"
argument, so let's get rid of it before it gets used again.
Till now whenever a server or proxy's queue was touched, this server
or proxy's lock was taken. Not only this requires distinct code paths,
but it also causes unnecessary contention with other uses of these locks.
This patch adds a lock inside the "queue" structure that will be used
the same way by the server and the proxy queuing code. The server used
to use a spinlock and the proxy an rwlock, though the queue only used
it for locked writes. This new version uses a spinlock since we don't
need the read lock part here. Tests have not shown any benefit nor cost
in using this one versus the rwlock so we could change later if needed.
The lower contention on the locks increases the performance from 362k
to 374k req/s on 16 threads with 20 servers and leastconn. The gain
with roundrobin even increases by 9%.
This is tagged medium because the lock is changed, but no other part of
the code touches the queues, with nor without locking, so this should
remain invisible.
This reverts commit fcb8bf8650.
The recent changes since 5304669e1 MEDIUM: queue: make
pendconn_process_next_strm() only return the pendconn opened a tiny race
condition between stream_free() and process_srv_queue(), as the pendconn
is accessed outside of the lock, possibly while it's being freed. A
different approach is required.
This reverts commit c83e45e9b0.
The recent changes since 5304669e1 MEDIUM: queue: make
pendconn_process_next_strm() only return the pendconn opened a tiny race
condition between stream_free() and process_srv_queue(), as the pendconn
is accessed outside of the lock, possibly while it's being freed. A
different approach is required.
This basically undoes the API changes that were performed by commit
0274286dd ("BUG/MAJOR: server: fix deadlock when changing maxconn via
agent-check") to address the deadlock issue: since process_srv_queue()
doesn't use the server lock anymore, it doesn't need the "server_locked"
argument, so let's get rid of it before it gets used again.
Till now whenever a server or proxy's queue was touched, this server
or proxy's lock was taken. Not only this requires distinct code paths,
but it also causes unnecessary contention with other uses of these locks.
This patch adds a lock inside the "queue" structure that will be used
the same way by the server and the proxy queuing code. The server used
to use a spinlock and the proxy an rwlock, though the queue only used
it for locked writes. This new version uses a spinlock since we don't
need the read lock part here. Tests have not shown any benefit nor cost
in using this one versus the rwlock so we could change later if needed.
The lower contention on the locks increases the performance from 491k
to 507k req/s on 16 threads with 20 servers and leastconn. The gain
with roundrobin even increases by 6%.
The performance profile changes from this:
13.03% haproxy [.] fwlc_srv_reposition
8.08% haproxy [.] fwlc_get_next_server
3.62% haproxy [.] process_srv_queue
1.78% haproxy [.] pendconn_dequeue
1.74% haproxy [.] pendconn_add
to this:
11.95% haproxy [.] fwlc_srv_reposition
7.57% haproxy [.] fwlc_get_next_server
3.51% haproxy [.] process_srv_queue
1.74% haproxy [.] pendconn_dequeue
1.70% haproxy [.] pendconn_add
At this point the differences are mostly measurement noise.
This is tagged medium because the lock is changed, but no other part of
the code touches the queues, with nor without locking, so this should
remain invisible.
This structure will be common to proxies and servers and will contain
everything needed to handle their respective queues. For now it's only
a tree head, a length and an index.
This essentially reverts commit 2b4370078 ("MINOR: lb/api: let callers
of take_conn/drop_conn tell if they have the lock") that was merged
during 2.4 before the various locks could be eliminated at the lower
layers. Passing that information complicates the cleanup of the queuing
code and it's become useless.
The server_parse_maxconn_change_request locks the server lock. However,
this function can be called via agent-checks or lua code which already
lock it. This bug has been introduced by the following commit :
commit 79a88ba3d0
BUG/MAJOR: server: prevent deadlock when using 'set maxconn server'
This commit tried to fix another deadlock with can occur because
previoulsy server_parse_maxconn_change_request requires the server lock
to be held. However, it may call internally process_srv_queue which also
locks the server lock. The locking policy has thus been updated. The fix
is functional for the CLI 'set maxconn' but fails to address the
agent-check / lua counterparts.
This new issue is fixed in two steps :
- changes from the above commit have been reverted. This means that
server_parse_maxconn_change_request must again be called with the
server lock.
- to counter the deadlock fixed by the above commit, process_srv_queue
now takes an argument to render the server locking optional if the
caller already held it. This is only used by
server_parse_maxconn_change_request.
The above commit was subject to backport up to 1.8. Thus this commit
must be backported in every release where it is already present.
GitHub uses github/linguist to determine the programming language used for each
source file to show statistics and to power the search. In cases of unique file
extensions this is easy, but for `.h` files the situation is less clear as they
are used for C, C++, Objective C and more. In these cases linguist makes use of
heuristics to determine the language.
One of these heuristics for C++ is that the file contains a line beginning with
`try`, only preceded by whitespace indentation. This heuristic matches the long
comment at the bottom of `channel-t.h`, as one sentence includes the word `try`
after a linebreak.
Fix this misdetection by changing the comment to follow the convention that all
lines start with an asterisk.
The function ssl_sock_load_srv_cert will be used at runtime for dynamic
servers. If the cert is not loaded on ckch tree, we try to access it
from the file-system.
Now this access operation is rendered optional by a new function
argument. It is only allowed at parsing time, but will be disabled for
dynamic servers at runtime.
Explicitly call ssl_initialize_random to initialize the random generator
in init() global function. If the initialization fails, the startup is
interrupted.
This commit is in preparation for support of ssl on dynamic servers. To
be able to activate ssl on dynamic servers, it is necessary to ensure
that the random generator is initialized on startup regardless of the
config. It cannot be called at runtime as access to /dev/urandom is
required.
This also has the effect to fix the previous non-consistent behavior.
Indeed, if bind or server in the config are using ssl, the
initialization function was called, and if it failed, the startup was
interrupted. Otherwise, the ssl initialization code could have been
called through the ssl server for lua, but this times without blocking
the startup on error. Or not called at all if lua was deactivated.
The SF_SRV_REUSED flag was set if a stream reused a backend connection.
One of its purpose is to count the total reuse on the backend in
opposition to newly instantiated connection.
However, the flag was diverted from its original purpose since the
following commit :
e8f5f5d8b2
BUG/MEDIUM: servers: Only set SF_SRV_REUSED if the connection if fully ready.
With this change, the flag is not set anymore if the mux is not ready
when a connection is picked for reuse. This can happen for multiplexed
connections which are inserted in the available list as soon as created
in http-reuse always mode. The goal of this change is to not retry
immediately this request in case on an error on the same server if the
reused connection is not fully ready.
This change is justified for the retry timeout handling but it breaks
other places which still uses the flag for its original purpose. Mainly,
in this case the wrong 'connect' backend counter is incremented instead
of the 'reuse' one. The flag is also used in http_return_srv_error and
may have an impact if a http server error is replied for this stream.
To fix this problem, the original purpose of the flag is restored by
setting it unconditionaly when a connection is reused. Additionally, a
new flag SF_SRV_REUSED_ANTICIPATED is created. This flag is set when the
connection is reused but the mux is not ready yet. For the timeout
handling on error, the request is retried immediately only if the stream
reused a connection without this newly anticipated flag.
This must be backported up to 2.1.
When a server relies on a SRV resolution, a task is created to clean it up
(fqdn/port and address) when the SRV resolution is considered as outdated
(based on the resolvers 'timeout' value). It is only possible if the server
inherits outdated info from a state file and is no longer selected to be
attached to a SRV item. Note that most of time, a server is attached to a
SRV item. Thus when the item becomes obsolete, the server is cleaned
up.
It is important to have such task to be sure the server will be free again
to have a chance to be resolved again with fresh information. Of course,
this patch is a workaround to solve a design issue. But there is no other
obvious way to fix it without rewritting all the resolvers part. And it must
be backportable.
This patch relies on following commits:
* MINOR: resolvers: Clean server in a dedicated function when removing a SRV item
* MINOR: resolvers: Remove server from named_servers tree when removing a SRV item
All the series must be backported as far as 2.2 after some observation
period. Backports to 2.0 and 1.8 must be evaluated.
To avoid repeating the same source code, allocating memory and initializing
the per_thr field from the server structure is transferred to a separate
function.
This function appends to a buffer some information from a connection.
This will be used by traces and possibly some debugging as well. A
frontend/backend/server, transport/control layers, source/destination
ip:port, connection pointer and direction are reported depending on
the available information.
With a single process, we don't need to USE_PRIVATE_CACHE, USE_FUTEX
nor USE_PTHREAD_PSHARED anymore. Let's only keep the basic spinlock
to lock between threads.
The relative_pid is always 1. In mworker mode we also have a
child->relative_pid which is always equalt relative_pid, except for a
master (0) or external process (-1), but these types are usually tested
for, except for one place that was amended to carefully check for the
PROC_O_TYPE_WORKER option.
Changes were pretty limited as most usages of relative_pid were for
designating a process in stats output and peers protocol.
Lots of places iterating over nbproc or comparing with nbproc could be
simplified. Further, "bind-process" and "process" parsing that was
already limited to process 1 or "all" or "odd" resulted in a bind_proc
field that was either 0 or 1 during the init phase and later always 1.
All the checks for compatibilities were removed since it's not possible
anymore to run a frontend and a backend on different processes or to
have peers and stick-tables bound on different ones. This is the largest
part of this patch.
The bind_proc field was removed from both the proxy and the receiver
structs.
Since the "process" and "bind-process" directives are still parsed,
configs making use of correct values allowing process 1 will continue
to work.
This is a leftover of a previous attempt that was introduced in 2.4 by
commit d3a88c1c3 ("MEDIUM: connection: close front idling connection on
soft-stop"). It can be backported, as the variable doesn't exist.
Since threads were introduced in 1.8, the USE_PRIVATE_CACHE mode of the
shctx was not updated to use locks. Originally it was meant to disable
sharing between processes, so it removes the lock/unlock instructions.
But with threads enabled, it's not possible to work like this anymore.
It's easy to see that once built with private cache and threads enabled,
sending violent SSL traffic to the the process instantly makes it die.
The HTTP cache is very likely affected as well.
This patch addresses this by falling back to our native spinlocks when
USE_PRIVATE_CACHE is used. In practice we could use them also for other
modes and remove all older implementations, but this patch aims at keeping
the changes very low and easy to backport. A new SHCTX_LOCK label was
added to help with debugging, but OTHER_LOCK might be usable as well
for backports.
An even lighter approach for backports may consist in always declaring
the lock (or reusing "waiters"), and calling pl_take_s() for the lock()
and pl_drop_s() for the unlock() operation. This could even be used in
all modes (process and threads), even when thread support is disabled.
Subsequent patches will further clean up this area.
This patch must be backported to all supported versions since 1.8.
This one was deprecated in 2.3 and marked for removal in 2.5. It suffers
too many limitations compared to threads, and prevents some improvements
from being engaged. Instead of a bypassable startup error, there is now
a hard error.
The parsing code was removed, and very few obvious cases were as well.
The code is deeply rooted at certain places (e.g. "for" loops iterating
from 0 to nbproc) so it will not be that trivial to remove everywhere.
The "bind" and "bind-process" parsers will have to be adjusted, though
maybe not completely changed if we later want to support thread groups
for large NUMA machines. Some stats socket restrictions were removed,
and the doc was updated according to what was done. A few places in the
doc still refer to nbproc and will have to be revisited. The master-worker
code also refers to the process number to distinguish between master and
workers and will have to be carefully adjusted. The MAX_PROCS macro was
reset to 1, this will at least reduce the size of some remaining arrays.
Two regtests were dependieng on this directive, one with an explicit
"nbproc 1" and another one testing the master's CLI using nbproc 4.
Both were adapted.
Commit ab0a5192a ("MEDIUM: config: mark "grace" as deprecated") marked
the "grace" keyword as deprecated in 2.3, tentative removal for 2.4
with a hard deadline in 2.5, so let's remove it and return an error now.
This old and outdated feature was incompatible with soft-stop, reload
and socket transfers, and keeping it forced ugly hacks in the lower
layers of the protocol stack.
This patch add a ref into servers to register them onto the
record answer item used to set their hostnames.
It also adds a head list into 'srvrq' to register servers free
to be affected to a SRV record.
A head of a tree is also added to srvrq to put servers which
present a hotname in server state file. To re-link them fastly
to the matching record as soon an item present the same name.
This results in better performances on SRV record response
parsing.
This is an optimization but it could avoid to trigger the haproxy's
internal wathdog in some circumstances. And for this reason
it should be backported as far we can (2.0 ?)
This patch adds a head list into answer items on servers which use
this record to set their IPs. It makes lookup on duplicated ip faster and
allow to check immediatly if an item is still valid renewing the IP.
This results in better performances on A/AAAA resolutions.
This is an optimization but it could avoid to trigger the haproxy's
internal wathdog in some circumstances. And for this reason
it should be backported as far we can (2.0 ?)
When an HTX block is expanded, a defragmentation may be performed first to
have enough space to copy the new data. When it happens, the meta data of
the HTX message must take account of the new data length but copied data are
still unchanged at this stage (because we need more space to update the
message content). And here there is a bug because the meta data are updated
by the caller. It means that when the blocks content is copied, the new
length is already set. Thus a block larger than the reality is copied and
data outside the buffer may be accessed, leading to a crash.
To fix this bug, htx_defrag() is updated to use an extra argument with the
new meta data to use for the referenced block. Thus the caller does not need
to update the HTX message by itself. However, it still have to update the
data.
Most of time, the bug will be encountered in the HTTP compression
filter. But, even if it is highly unlikely, in theory it is also possible to
hit it when a HTTP header (or only its value) is replaced or when the
start-line is changed.
This patch must be backported as far as 2.0.
A tiny change in commit 6af81f80f ("MEDIUM: errors: implement parsing
context type") triggered an awful bug in gcc 5 and below (4.7.4 to 5.5
confirmed affected, at least on aarch64/mips/x86_64) causing the startup
to loop forever in acl_find_target().
This was tracked down to the acl.c file seeing a different definition
of the struct proxy than other files. The reason for this is that it
sees an unpacked "enum obj_type" (4 bytes) while others see it packed
(1 byte), thus all fields in the struct are having a different
alignment, and the "acl" list is shifted one pointer to the next struct
and seems to loop onto itself.
The commit above did nothing more than adding "enum obj_type *obj" in a
new struct without including obj_type.h, and that was apparently enough
for the compiler to internally declare obj_type as a regular enum and
silently ignore the packed attribute that it discovers later, so depending
on the order of includes, some files would see it as 1 byte and others as
4.
This patch simply adds the missing include but due to the nature of the
bug, probably that creating a special "packed_enum" definition to disable
the packed attribute on such compilers could be a safer option.
No backport is needed as this is only in -dev.
The ifdefs surrounding the "show ssl ocsp-response" functionality that
were supposed to disable the code with BoringSSL were built the wrong
way.
It does not need to be backported.
Now that the modified lockless variant does not need a DWCAS anymore,
there's no reason to keep the much slower locked version, so let's
just get rid of it.
In GH issue #1275, Fabiano Nunes Parente provided a nicely detailed
report showing reproducible crashes under musl. Musl is one of the libs
coming with a simple allocator for which we prefer to keep the shared
cache. On x86 we have a DWCAS so the lockless implementation is enabled
for such libraries.
And this implementation has had a small race since day one: the allocator
will need to read the first object's <next> pointer to place it into the
free list's head. If another thread picks the same element and immediately
releases it, while both the local and the shared pools are too crowded, it
will be freed to the OS. If the libc's allocator immediately releases it,
the memory area is unmapped and we can have a crash while trying to read
that pointer. However there is no problem as long as the item remains
mapped in memory because whatever value found there will not be placed
into the head since the counter will have changed.
The probability for this to happen is extremely low, but as analyzed by
Fabiano, it increases with the buffer size. On 16 threads it's relatively
easy to reproduce with 2MB buffers above 200k req/s, where it should
happen within the first 20 seconds of traffic usually.
This is a structural issue for which there are two non-trivial solutions:
- place a read lock in the alloc call and a barrier made of lock/unlock
in the free() call to force to serialize operations; this will have
a big performance impact since free() is already one of the contention
points;
- change the allocator to use a self-locked head, similar to what is
done in the MT_LISTS. This requires two memory writes to the head
instead of a single one, thus the overhead is exactly one memory
write during alloc and one during free;
This patch implements the second option. A new POOL_DUMMY pointer was
defined for the locked pointer value, allowing to both read and lock it
with a single xchg call. The code was carefully optimized so that the
locked period remains the shortest possible and that bus writes are
avoided as much as possible whenever the lock is held.
Tests show that while a bit slower than the original lockless
implementation on large buffers (2MB), it's 2.6 times faster than both
the no-cache and the locked implementation on such large buffers, and
remains as fast or faster than the all implementations when buffers are
48k or higher. Tests were also run on arm64 with similar results.
Note that this code is not used on modern libcs featuring a fast allocator.
A nice benefit of this change is that since it removes a dependency on
the DWCAS, it will be possible to remove the locked implementation and
replace it with this one, that is then usable on all systems, thus
significantly increasing their performance with large buffers.
Given that lockless pools were introduced in 1.9 (not supported anymore),
this patch will have to be backported as far as 2.0. The code changed
several times in this area and is subject to many ifdefs which will
complicate the backport. What is important is to remove all the DWCAS
code from the shared cache alloc/free lockless code and replace it with
this one. The pool_flush() code is basically the same code as the
allocator, retrieving the whole list at once. If in doubt regarding what
barriers to use in older versions, it's safe to use the generic ones.
This patch depends on the following previous commits:
- MINOR: pools: do not maintain the lock during pool_flush()
- MINOR: pools: call malloc_trim() under thread isolation
- MEDIUM: pools: use a single pool_gc() function for locked and lockless
The last one also removes one occurrence of an unneeded DWCAS in the
code that was incompatible with this fix. The removal of the now unused
seq field will happen in a future patch.
Many thanks to Fabiano for his detailed report, and to Olivier for
his help on this issue.
Since the code was reorganized, DEBUG_UAF was still tested in the locked
pool code despite pools being disabled when DEBUG_UAF is used. Let's move
the test to pool_put_to_os() which is the one that is always called in
this condition.
The impact is only a possible misleading analysis during a troubleshooting
session due to a missing double-frees or free of const area test that is
normally already dealt with by the underlying code anyway. In practice it's
unlikely anyone will ever notice.
This should only be backported to 2.4.
This patch adds the "show ssl ocsp-response [<id>]" CLI command. This
command can be used to display the IDs of the OCSP tree entries along
with details about the entries' certificate ID (issuer's name and key
hash + serial number), or to display the details of a single
ocsp-response if an ID is given. The details displayed in this latter
case are the ones shown by a "openssl ocsp -respin <ocsp-response>
-text" call.
Since commit 04a5a44 ("BUILD: ssl: use HAVE_OPENSSL_KEYLOG instead of
OpenSSL versions") the "tune.ssl.keylog" feature is broken because
HAVE_OPENSSL_KEYLOG does not exist.
Replace this by a HAVE_SSL_KEYLOG which is defined in openssl-compat.h.
Also add an error when not built with the right openssl version.
Must be backported as far as 2.3.
Change the algorithm for the generation of the user messages context
prefix. Remove the dubious API relying on optional printf positional
arguments. This may be non portable, and in fact the CI glibc crashes
with the following error when some arguments are not present in the
format string :
"invalid %N$ use detected".
Now, a fixed buffer attached to the context instance is allocated once
for the program lifetime. Then call repeatedly snprintf with the
optional arguments of context if present to build the context string.
The buffer is deallocated via a per-thread free handler.
This does not need to be backported.
This patch adds the `-cc` (check condition) argument to evaluate conditions on
startup and return the result as the exit code.
As an example this can be used to easily check HAProxy's version in scripts:
haproxy -cc 'version_atleast(2.4)'
This resolves GitHub issue #1246.
Co-authored-by: Tim Duesterhus <tim@bastelstu.be>
Create a parsing_ctx structure. This type is used to store information
about the current file/line parsed. A global context is created and
can be manipulated when haproxy is in STARTING mode. When starting is
over, the context is resetted and should not be accessed anymore.
The user messages buffer is used to store the stderr output after the
starting is over. Each thread has it own user messages buffer. Add some
functions to add a new message, retrieve and clear the content.
The user messages buffer primary goal is to be consulted by CLI
handlers. Each handlers using it must clear the buffer before starting
its operation.
Move functions related to errors output on stderr from log.c to a newly
created errors.c file. It targets print_message and
ha_alert/warning/notice/diag functions and related startup_logs feature.
A memory allocation failure happening in chash_init_server_tree while
trying to allocate a server's lb_nodes item used in consistent hashing
would have resulted in a crash. This function is only called during
configuration parsing.
It was raised in GitHub issue #1233.
It could be backported to all stable branches.
A memory allocation failure happening in mworker_env_to_proc_list when
trying to allocate a mworker_proc would have resulted in a crash. This
function is only called during init.
It was raised in GitHub issue #1233.
It could be backported to all stable branches.
A memory allocation failure happening during peers_register_table would
have resulted in a crash. This function is only called during init.
It was raised in GitHub issue #1233.
It could be backported to all stable branches.
Two calloc calls were not checked in the srv_parse_source function.
Considering that this function could be called at runtime through a
dynamic server creation via the CLI, this could lead to an unfortunate
crash.
It was raised in GitHub issue #1233.
It could be backported to all stable branches even though the runtime
crash could only happen on branches where dynamic server creation is
possible.
b_slow_realign() function may be used to realign a buffer with a given
amount of output data, eventually 0. In such case, the head is set to
0. This function is not designed to be used with input only buffers, like
those used in the muxes. It is the purpose of b_slow_realign_ofs()
function. It does almost the same, realign a buffer. But it do so by setting
the buffer head to a specific offset.
h1 parsing functions (h1_parse_msg_*) returns the number of bytes parsed or
0 if nothing is parsed because an error occurred or some data are
missing. But they never return negative values. Thus, instead of a signed
integer, these function now return a size_t value.
The H1 and FCGI muxes are updated accordingly. Note that h1_parse_msg_data()
has been slightly adapted because the parsing of chunked messages still need
to handle negative values when a parsing error is reported by
h1_parse_chunk_size() or h1_skip_chunk_crlf().
The output of "show map/acl" now contains the 'entry_cnt' value that
represents the count of all the entries for each map/acl, not just the
active ones, which means that it also includes entries currently being
added.
The first item inserted into an ebtree will be inserted directly below
the root, which is a simple struct eb_root which only holds two branch
pointers (left and right).
If we try to find a duplicated entry to this first leaf through a
ebmb_next_dup, our leaf_p pointer will point to the eb_root instead of a
complete eb_node so we cannot look for the bit part of our leaf_p since
it would try to cast our eb_root into an eb_node and perform an out of
bounds access when reading "eb_root_to_node(eb_untag(t,EB_LEFT)))->bit".
This bug was found by address sanitizer running on a CRL hot update VTC
test.
Note that the bug has been there since the import of the eb_next_dup()
and eb_prev_dup() function in 1.5-dev19 by commit 2b5702030 ("MINOR:
ebtree: add new eb_next_dup/eb_prev_dup() functions to visit duplicates").
It can be backported to all stable branches.
The following functions used in CA/CRL file hot update were not defined
in OpenSSL 1.0.2 so they need to be defined in openssl-compat :
- X509_CRL_get_signature_nid
- X509_CRL_get0_lastUpdate
- X509_CRL_get0_nextUpdate
- X509_REVOKED_get0_serialNumber
- X509_REVOKED_get0_revocationDate
This patch adds the "set ssl crl-file" and "commit ssl crl-file"
commands, following the same logic as the certificate and CA file update
equivalents.
When trying to update a Certificate Revocation List (CRL) file via a
"set" command, we start by looking for the entry in the CA file tree and
then building a new cafile_entry out of the payload, without adding it
to the tree yet. It will only be added when a "commit" command is
called.
During a "commit" command, we insert the newly built cafile_entry in the
CA file tree while keeping the previous entry. We then iterate over all
the instances that used the CRL file and rebuild a new one and its
dedicated SSL context for every one of them.
When all the contexts are properly created, the old instances get
replaced by the new ones and the old CRL file is removed from the tree.
The CA files and CRL files are stored in the same cafile_tree so this
patch adds a new field the the cafile_entry structure that specifies the
type of the entry. Since a ca-file can also have some CRL sections, the
type will be based on the option used to load the file and not on its
content (ca-file vs crl-file options).
This patch adds the "set ssl ca-file" and "commit ssl ca-file" commands,
following the same logic as the certificate update equivalents.
When trying to update a ca-file entry via a "set" command, we start by
looking for the entry in the cafile_tree and then building a new
cafile_entry out of the given payload. This new object is not added to
the cafile_tree until "commit" is called.
During a "commit" command, we insert the newly built cafile_entry in the
cafile_tree, while keeping the previous entry as well. We then iterate
over all the instances linked in the old cafile_entry and rebuild a new
ckch instance for every one of them. The newly inserted cafile_entry is
used for all those new instances and their respective SSL contexts.
When all the contexts are properly created, the old instances get
replaced by the new ones and the old cafile_entry is removed from the
tree.
This fixes a subpart of GitHub issue #1057.
Adds a way to insert a new uncommitted cafile_entry in the tree. This
entry will be the one fetched by any lookup in the tree unless the
oldest cafile_entry is explicitely looked for. This way, until a "commit
ssl ca-file" command is completed, there could be two cafile_entries
with the same path in the tree, the original one and the newly updated
one.
The updated CA content coming from the CLI during a ca-file update will
directly be in memory and not on disk so the way CAs are loaded in a
cafile_entry for now (via X509_STORE_load_locations calls) cannot be
used.
This patch adds a way to fill a cafile_entry directly from memory and to
load the contained certificate and CRL sections into an SSL store.
CRL sections are managed as well as certificates in order to mimic the
way CA files are processed when specified in an option. Indeed, when
parsing a CA file given through a ca-file or ca-verify-file option, we
iterate over the different sections in ssl_set_cert_crl_file and load
them regardless of their type. This ensures that a file that was
properly parsed when given as an option will also be accepted by the
CLI.
In order for the link between the cafile_entry and the default ckch
instance to be built, we need to give a pointer to the instance during
the ssl_sock_prepare_ctx call.
Each ca-file entry of the tree will now hold a list of the ckch
instances that use it so that we can iterate over them when updating the
ca-file via a cli command. Since the link between the SSL contexts and
the CA file tree entries is only built during the ssl_sock_prepare_ctx
function, which are called after all the ckch instances are created, we
need to add a little post processing after each ssl_sock_prepare_ctx
that builds the link between the corresponding ckch instance and CA file
tree entries.
In order to manage the ca-file and ca-verify-file options, any ckch
instance can be linked to multiple CA file tree entries and any CA file
entry can link multiple ckch instances. This is done thanks to a
dedicated list of ckch_inst references stored in the CA file tree
entries over which we can iterate (during an update for instance). We
avoid having one of those instances go stale by keeping a list of
references to those references in the instances.
When deleting a ckch_inst, we can then remove all the ckch_inst_link
instances that reference it, and when deleting a cafile_entry, we
iterate over the list of ckch_inst reference and clear the corresponding
entry in their own list of ckch_inst_link references.
This patch moves all the ssl_store related code to ssl_ckch.c since it
will mostly be used there once the CA file update CLI commands are all
implemented. It also makes the cafile_entry structure visible as well as
the cafile_tree.
stdint.h is not as portable as inttypes.h. It doesn't exist at least
on AIX 5.1 and Solaris 7, while inttypes.h is present there and does
include stdint.h on platforms supporting it.
This is equivalent to libslz upstream commit e36710a ("slz: use
inttypes.h instead of stdint.h")
On ARM with native CRC support, no need to inflate the executable with
a 4kB CRC table, let's just drop it.
This is slz upstream commit d8715db20b2968d1f3012a734021c0978758f911.
This function was not used anymore after the atomic updates were
implemented in 2.3, and it must not be used given that it does not
yield and can easily make the process hang for tens of seconds on
large acls/maps. Let's remove it before someone uses it as an
example to implement something else!
It's still very difficult to find all commands starting with a given
keyword like "set", "show" etc. Let's sort the lines by usage message,
this is much more convenient.
The build fails on versions older than 1.0.1d which is the first one
introducing CRYPTO_memcmp(), so let's have a define for this instead
of enabling it whenever USE_OPENSSL is set. One could also wonder why
we're relying on openssl for such a trivial thing, and a simple local
implementation could also allow to restore lexicographic ordering.
Some of the Lua doc and a few places still used "Haproxy" or "HAproxy".
There was even one "HA proxy". A few of them were in an example of VTest
output, indicating that VTest ought to be fixed as well. No big deal but
better address all the remaining ones so that these inconsistencies stop
spreading around.
listener-t comes with openssl just due to the SSL_CTX type that is
declred as a typedef in openssl hence cannot be abstracted at this
level. However connection-t.h doen't need all that just to know that
bind_conf is a struct. Let's declare it with other external types
instead..
These ones are used by virtually every config parser. Not only they
provide no benefit in being inlined, but they imply a very deep
dependency starting at proxy.h, which results for example in task.c
including openssl.
Let's move these two functions to cfgparse.c.
These caddr_* functions were once placed into tools.h in the hope they
would be useful but nobody knows they exist. They could deserve being
moved to their own file with other pointer manipulation functions maybe,
but for now they're the only reason left for stick_table.h to include
tools.h, so let's move them directly there since it's its only user.
This allows to remove tools.h from stick_table.h and slightly reduce
the overall build time.
This function has no business being inlined in stick_table.h since it's
only used at boot time by the config parser. In addition it causes an
undesired dependency on tools.h because it uses parse_time_err(). Let's
move it to stick_table.c.
No idea why this was put inlined into connection.h, it's used only once
for haproxy -vv, and requires tools.h, causing an undesired dependency
from connection.h. Let's move it to connection.c instead where it ought
to have been.
Only mworker uses proc_self, and it was declared in global.h, forcing
users of global.h to include mworker and its dependencies.
Moving it to mworker reduces the preprocessed size of version.c from
170 to 125kB by shrinking the number of local includes from 30 to 16
and the number of system includes from 147 to 132.
The presence of this field causes a long dependency chain because almost
everyone includes global-t.h, and vars include sample_data which include
some system includes as well as HTTP parts.
There is absolutely no reason for having the process-wide variables in
the global struct, let's just move them into vars.c and vars.h. This
reduces from ~190k to ~170k the preprocessed output of version.c.
This will allow some fields to be produced with a higher accuracy when
the requester indicates being able to parse floats. Rates and times are
among the elements which can make sense.
Currently the stats filling function knows nothing about the caller's
needs, so let's pass the STAT_* flags so that it can adapt to the
requester's constraints.
For stats reporting it can be convenient to report floats at low rates
instead of discrete integers. We do have quite some precision since we
currently divide counters by number of milliseconds, so we can usually
add 3 digits after the decimal point.