Implement an equivalent of task_kill for tasklets. This function can be
used to request a tasklet deletion in a thread-safe way.
Currently this function is unused.
There was no way to access the SPOE filter configuration from the agent
object. However it could be handy to have it. And in fact, this will be
required to fix a bug.
Right now we're using a DWCAS to atomically set the running_mask while
being constrained by the thread_mask. This DWCAS is annoying because we
may seriously need it later when adding support for thread groups, for
checking that the running_mask applies to the correct group.
It turns out that the DWCAS is not strictly necessary because we never
need it to set the thread_mask based on the running_mask, only the other
way around. And in fact, the running_mask is always cleared alone, and
the thread_mask is changed alone as well. The running_mask is only
relevant to indicate a takeover when the thread_mask matches it. Any
bit set in running and not present in thread_mask indicates a transition
in progress.
As such, it is possible to re-arrange this by using a regular CAS around a
consistency check between running_mask and thread_mask in fd_update_events
and by making a CAS on running_mask then an atomic store on the thread_mask
in fd_takeover(). The only other case is fd_delete() but that one already
sets the running_mask before clearing the thread_mask, which is compatible
with the consistency check above.
This change has happily survived 10 billion takeovers on a 16-thread
machine at 800k requests/s.
The fd-migration doc was updated to reflect this change.
This one is set whenever an FD is reported by a poller with a null owner,
regardless of the thread_mask. It has become totally meaningless because
it only indicates a migrated FD that was not yet reassigned to a thread,
but as soon as a thread uses it, the status will change to skip_fd. Thus
there is no reason to distinguish between the two, it adds more confusion
than it helps. Let's simply drop it.
The current principle of running under isolation was made to access
sensitive data while being certain that no other thread was using them
in parallel, without necessarily having to place locks everywhere. The
main use case are "show sess" and "show fd" which run over long chains
of pointers.
The thread_isolate() call relies on the "harmless" bit that indicates
for a given thread that it's not currently doing such sensitive things,
which is advertised using thread_harmless_now() and which ends usings
thread_harmless_end(), which also waits for possibly concurrent threads
to complete their work if they took this opportunity for starting
something tricky.
As some system calls were notoriously slow (e.g. mmap()), a bunch of
thread_harmless_now() / thread_harmless_end() were placed around them
to let waiting threads do their work while such other threads were not
able to modify memory contents.
But this is not sufficient for performing memory modifications. One such
example is the server deletion code. By modifying memory, it not only
requires that other threads are not playing with it, but are not either
in the process of touching it. The fact that a pool_alloc() or pool_free()
on some structure may call thread_harmless_now() and let another thread
start to release the same object's memory is not acceptable.
This patch introduces the concept of "idle threads". Threads entering
the polling loop are idle, as well as those that are waiting for all
others to become idle via the new function thread_isolate_full(). Once
thread_isolate_full() is granted, the thread is not idle anymore, and
it is released using thread_release() just like regular isolation. Its
users have to keep in mind that across this call nothing is granted as
another thread might have performed shared memory modifications. But
such users are extremely rare and are actually expecting this from their
peers as well.
Note that that in case of backport, this patch depends on previous patch:
MINOR: threads: make thread_release() not wait for other ones to complete
This patch splits the disabled state of a proxy into a PR_DISABLED and a
PR_STOPPED state.
The first one is set when the proxy is disabled in the configuration
file, and the second one is set upon a stop_proxy().
Rename the 'dontloglegacyconnerr' option to 'log-error-via-logformat'
which is much more self-explanatory and readable.
Note: only legacy keywords don't use hyphens, it is recommended to
separate words with them in new keywords.
The x86-tso model makes the load and store barriers unneeded for our
usage as long as they perform at least a compiler barrier: the CPU
will respect store ordering and store vs load ordering. It's thus
safe to remove the lfence and sfence which are normally needed only
to communicate with external devices. Let's keep the mfence though,
to make sure that reads of same memory location after writes report
the value from memory and not the one snooped from the write buffer
for too long.
An in-depth review of all use cases tends to indicate that this is
okay in the rest of the code. Some parts could be cleaned up to
use atomic stores and atomic loads instead of explicit barriers
though.
Doing this reliably increases the overall performance by about 2-2.5%
on a 8c-16t Xeon thanks to less frequent flushes (it's likely that the
biggest gain is in the MT lists which use them a lot, and that this
results in less cache line flushes).
The atomic_load/atomic_store/atomic_xchg operations were all forced to
__ATOMIC_SEQ_CST, which results in explicit store or even full barriers
even on x86-tso while we do not need them: we're not communicating with
external devices for example and are only interested in respecting the
proper ordering of loads and stores between each other.
These ones being rarely used, the emitted code on x86 remains almost the
same (barring a handful of locations). However they will allow to place
correct barriers at other places where atomics are accessed a bit lightly.
The patch is marked medium because we can never rule out the risk of some
bugs on more relaxed platforms due to the rest of the code.
update_freq_ctr_period() was using relaxed atomics without using barriers,
which usually works fine on x86 but not everywhere else. In addition, some
values were read without being enclosed by barriers, allowing the compiler
to possibly prefetch them a bit earlier. Finally, freq_ctr_total() was also
reading these without enough barriers. Let's make explicit use of atomic
loads and atomic stores to get rid of this situation. This required to
slightly rearrange the freq_ctr_total() loop, which could possibly slightly
improve performance under extreme contention by avoiding to reread all
fields.
A backport may be done to 2.4 if a problem is encountered, but last tests
on arm64 with LSE didn't show any issue so this can possibly stay as-is.
This function already performs a number of checks prior to calling the
IOCB, and detects the change of thread (FD migration). Half of the
controls are still in each poller, and these pollers also maintain
activity counters for various cases.
Note that the unreliable test on thread_mask was removed so that only
the one performed by fd_set_running() is now used, since this one is
reliable.
Let's centralize all that fd-specific logic into the function and make
it return a status among:
FD_UPDT_DONE, // update done, nothing else to be done
FD_UPDT_DEAD, // FD was already dead, ignore it
FD_UPDT_CLOSED, // FD was closed
FD_UPDT_MIGRATED, // FD was migrated, ignore it now
Some pollers already used to call it last and have nothing to do after
it, regardless of the result. epoll has to delete the FD in case a
migration is detected. Overall this removes more code than it adds.
Since 2.4 with commit f50906519 ("MEDIUM: fd: merge fdtab[].ev and state
for FD_EV_* and FD_POLL_* into state") we can merge all flag updates at
once in fd_update_events(). Previously this was performed in 1 to 3 steps,
setting the polling state, then setting READY_R if in/err/hup, and setting
READY_W if out/err. But since the commit above, all flags are stored
together in the same structure field that is being updated with the new
flags, thus we can simply update the flags altogether and avoid multiple
atomic operations. This even removes the need for atomic ops for FDs that
are not shared.
There's a theoretical race (that we failed to trigger) in function
fd_update_events(), which could strike on idle connections. The "locked"
variable will most often be 0 as the FD is bound to the current thread
only. Another thread could take it over once "locked" is set, change
the thread and running masks. Then the first thread updates the FD's
state non-atomically and possibly overwrites what the other thread was
preparing. It still looks like the FD's state will ultimately converge
though.
The solution against this is to set the running flag earlier so that a
takeover() attempt cannot succeed, or that the fd_set_running() attempt
fails, indicating that nothing needs to be done on this FD.
While this is sufficient for a simple fix to be backported, it leaves
the FD actively polled in the calling thread, this will trigger a second
wakeup which will notice the absence of tid_bit in the thread_mask,
getting rid of it.
A more elaborate solution would consist in calling fd_set_running()
directly from the pollers before calling fd_update_events(), getting
rid of the thread_mask test and letting the caller eliminate that FD
from its list if needed.
Interestingly, this code also proves to be suboptimal in that it sets
the FD state twice instead of calculating the new state at once and
always using a CAS to set it. This is a leftover of a simplification
that went into 2.4 and which should be explored in a future patch.
This may be backported as far as 2.2.
The takeover of idle conns between threads is particularly tricky, for
two reasons:
- there's no way to atomically synchronize kernel-side polling with
userspace activity, so late events will always be reported for some
FDs just migrated ;
- upon error, an FD may be immediately reassigned to whatever other
thread since it's process-wide.
The current model uses the FD's thread_mask to figure if an FD still
ought to be reported or not, and a per-thread idle connection queue
from which eligible connections are atomically added/picked. I/Os
coming from the bottom for such a connection must remove it from the
list so that it's not elected. Same for timeout tasks and iocbs. And
these last ones check their context under the idle_conn lock to judge
if they're still allowed to run.
One rare case was omitted: the wake() callback. This one is rare, it
may serve to notify about finalized connect() calls that are not being
polled, as well as unhandled shutdowns and errors. This callback was
not protected till now because it wasn't seen as sensitive, but there
exists a particular case where it may be called without protectoin in
parallel to a takeover. This happens in the following sequence:
- thread T1 wants to establish an outgoing connection
- the connect() call returns EINPROGRESS
- the poller adds it using epoll_ctl()
- epoll_wait() reports it, connect() is done. The connection is
not being marked as actively polled anymore but is still known
from the poller.
- the request is sent over that connection using send(), which
queues to system buffers while data are being delivered
- the scheduler switches to other tasks
- the request is physically sent
- the server responds
- the stream is notified that send() succeeded, and makes progress,
trying to recv() from that connection
- the recv() succeeds, the response is delivered
- the poller doesn't need to be touched (still no active polling)
- the scheduler switches to other tasks
- the server closes the connection
- the poller on T1 is notified of the SHUTR and starts to call mux->wake()
- another thread T2 takes over the connection
- T2 continues to run inside wake() and releases the connection
- T2 is just dereferencing it.
- BAM.
The most logical solution here is to surround the call to wake() with an
atomic removal/insert of the connection from/into the idle conns lists.
This way, wake() is guaranteed to run alone. Any other poller reporting
the FD will not have its tid_bit in the thread_mask si will not bother
it. Another thread trying a takeover will not find this connection. A
task or tasklet being woken up late will either be on the same thread,
or be called on another one with a NULL context since it will be the
consequence of previous successful takeover, and will be ignored. Note
that the extra cost of a lock and tree access here have a low overhead
which is totally amortized given that these ones roughly happen 1-2
times per connection at best.
While it was possible to crash the process after 10-100k req using H2
and a hand-refined configuration achieving perfect synchronism between
a long (20+) chain of proxies and a short timeout (1ms), now with that
fix this never happens even after 10M requests.
Many thanks to Olivier for proposing this solution and explaining why
it works.
This should be backported as far as 2.2 (when inter-thread takeover was
introduced). The code in older versions will be found in conn_fd_handler().
A workaround consists in disabling inter-thread pool sharing using:
tune.idle-pool.shared off
In case of connection failure, a dedicated error message is output,
following the format described in section "Error log format" of the
documentation. These messages cannot be configured through a log-format
option.
This patch adds a new option, "dontloglegacyconnerr", that disables
those error logs when set, and "replaces" them by a regular log line
that follows the configured log-format (thanks to a call to sess_log in
session_kill_embryonic).
The new fc_conn_err sample fetch allows to add the legacy error log
information into a regular log format.
This new option is unset by default so the logging logic will remain the
same until this new option is used.
This new sample fetch along the ssl_fc_hsk_err_str fetch contain the
last SSL error of the error stack that occurred during the SSL
handshake (from the frontend's perspective). The errors happening during
the client's certificate verification will still be given by the
ssl_c_err and ssl_c_ca_err fetches. This new fetch will only hold errors
retrieved by the OpenSSL ERR_get_error function.
The fc_conn_err and fc_conn_err_str sample fetches give information
about the problem that made the connection fail. This information would
previously only have been given by the error log messages meaning that
thanks to these fetches, the error log can now be included in a custom
log format. The log strings were all found in the conn_err_code_str
function.
The CO_ER_SSL_EARLY_FAILED and CO_ER_CIP_TIMEOUT connection error codes
were missing in the conn_err_code_str switch which converts the error
codes into string.
This patch can be backported on all stable branches.
This patch renames the proxy capability "LUA" to "INT" so it could be
used for any internal proxy.
Every proxy that are not user defined should use this flag.
Oss-fuzz reports in issue 36328 that we can recurse too far by passing
extremely deep expressions to the ".if" parser. I thought we were still
limited to the 1024 chars per line, that would be highly sufficient, but
we don't have any limit now :-/
Let's just pass a maximum recursion counter to the recursive parsers.
It's decremented for each call and the expression fails if it reaches
zero. On the most complex paths it can add 3 levels per parenthesis,
so with a limit of 1024, that's roughly 343 nested sub-expressions that
are supported in the worst case. That's more than sufficient, for just
a few kB of RAM.
No backport is needed.
This option had always been broken in HTX, which means that the first
breakage appeared in 1.9, that it was broken by default in 2.0 and that
no workaround existed starting with 2.1. The way this option works is
praticularly unfit to the rest of the configuration and to the internal
architecture. It had some uses when it was introduced 14 years ago but
nowadays it's possible to do much better and more reliable using a
set of "http-request set-dst" and "http-request set-uri" rules, which
additionally are compatible with DNS resolution (via do-resolve) and
are not exclusive to normal load balancing. The "option-http_proxy"
example config file was updated to reflect this.
The option is still parsed so that an error message gives hints about
what to look for.
The cfg_free_cond_{term,and,expr}() functions used to take a pointer to
the pointer to be freed in order to replace it with a NULL once done.
But this doesn't cope well with freeing lists as it would require
recursion which the current code tried to avoid.
Let's just change the API to free the area and let the caller set the NULL.
This leak was reported by oss-fuzz (issue 36265).
Now it's possible to form a term using parenthesis around an expression.
This will soon allow to build more complex expressions. For now they're
still pretty limited but parenthesis do work.
Now evaluating a condition will rely on an expression (or an empty string),
and this expression will support ORing a sub-expression with another
optional expression. The sub-expressions ANDs a term with another optional
sub-expression. With this alone precedence between && and || is respected,
and the following expression:
A && B && C || D || E && F || G
will naturally evaluate as:
(A && B && C) || D || (E && F) || G
It's not convenient to let the caller be responsible for node allocation,
better have the leaf function do that and implement the accompanying free
call. Now only a pointer is needed instead of a struct, and the leaf
function makes sure to leave the situation in a consistent way.
The purpose is to build a descendent parser that will split conditions
into expressions made of terms. There are two phases, a parsing phase
and an evaluation phase. Strictly speaking it's not required to cut
that in two right now, but it's likely that in the future we won't want
certain predicates to be evaluated during the parsing (e.g. file system
checks or execution of some external commands).
The cfg_eval_condition() function is now much simpler, it just tries to
parse a single term, and if OK evaluates it, then returns the result.
Errors are unchanged and may still be reported during parsing or
evaluation.
It's worth noting that some invalid expressions such as streq(a,b)zzz
continue to parse correctly for now (what remains after the parenthesis
is simply ignored as not necessary).
The .if/.else/.endif and condition evaluation code is quite dirty and
was dumped into cfgparse.c because it was easy. But it should be tidied
quite a bit as it will need to evolve.
Let's move all that to cfgcond.{c,h}.
make_arg_list() can create an array of arguments, some of which remain
to be resolved, but all users had to deal with their own roll back on
error. Let's add a free_args() function to release all the array's
elements and let the caller deal with the array itself (sometimes it's
allocated in the stack).
Since 1.9 with commit 673867c35 ("MAJOR: applets: Use tasks, instead
of rolling our own scheduler.") the thread_mask field of the appctx
became unused, but the code hadn't been cleaned for this. The appctx
has its own task and the task's thread_mask is the one to be displayed.
It's worth noting that all calls to appctx_new() pass tid_bit as the
thread_mask. This makes sense, and it could be convenient to decide
that this becomes the norm and to simplify the API.
Define a new global config statement named
"h2-workaround-bogus-websocket-clients".
This statement will disable the automatic announce of h2 websocket
support as specified in the RFC8441. This can be use to overcome clients
which fail to implement the relatively fresh RFC8441. Clients will in
his case automatically downgrade to http/1.1 for the websocket tunnel
if the haproxy configuration allows it.
This feature is relatively simple and can be backported up to 2.4, which
saw the introduction of h2 websocket support.
Replace http_get_path by the http_uri_parser API. The new functions is
renamed http_parse_path. Replace duplicated code for scheme and
authority parsing by invocations to http_parse_scheme/authority.
If no scheme is found for an URI detected as an absolute-uri/authority,
consider it to be an authority format : no path will be found. For an
absolute-uri or absolute-path, use the remaining of the string as the
path. A new http_uri_parser state is declared to mark the path parsing
as done.
Replace http_get_authority by the http_uri_parser API.
The new function is renamed http_parse_authority. Replace duplicated
scheme parsing code by http_parse_scheme invocation. A new
http_uri_parser state is declared to mark the authority parsing as done.
Replace http_get_scheme by the http_uri_parser API. The new function is
renamed http_parse_scheme. A new http_uri_parser state is declared to
mark the scheme parsing as completed.
Implement a http uri parser type. This type will be used as a context to
parse the various elements of an uri.
The goal of this serie of patches is to factorize duplicated code
between the http_get_scheme/authority/path functions. A simple parsing
API is designed to be able to extract once each element of an HTTP URI
in order. The functions will be renamed in the following patches to
reflect the API change with the prefix http_parse_*.
For the parser API, the http_uri_parser type must first be
initialized before usage. It will register the URI to parse and detect
its format according to the rfc 7230.
Implement the scheme-based uri normalization as described in rfc3986
6.3.2. Its purpose is to remove the port of an uri if the default one is
used according to the uri scheme : 80/http and 443/https. All other
ports are not touched.
This method uses an htx message as an input. It requires that the target
URI is in absolute-form with a http/https scheme. This represents most
of h2 requests except CONNECT. On the contrary, most of h1 requests
won't be elligible as origin-form is the standard case.
The normalization is first applied on the target URL of the start line.
Then, it is conducted on every Host headers present, assuming that they
are equivalent to the target URL.
This change will be notably useful to not confuse users who are
accustomed to use the host for routing without specifying default ports.
This problem was recently encountered with Firefox which specify the 443
default port for http2 websocket Extended CONNECT.
This patch adds the definition of two new array data_types:
'gpc': This is an array of 32bits General Purpose Counters.
'gpc_rate': This is an array on increment rates of General Purpose Counters.
Like for all arrays, they are limited to 100 elements.
This patch also adds actions and fetches to handle
elements of those arrays.
Note: As documented, those new actions and fetches won't
apply to the legacy 'gpc0', 'gpc1', 'gpc0_rate' nor 'gpc1_rate'.
This patch adds the definition of a new array data_type
'gpt'. This is an array of 32bits General Purpose Tags.
Like for all arrays, it is limited to 100 elements.
This patch also adds actions and fetches to handle
elements of this array.
Note: As documented, those new actions and fetches won't
apply to the legacy 'gpt0' data type.
This patch adds support of array data_types on the peer protocol.
The table definition message will provide an additionnal parameter
for array data-types: the number of elements of the array.
In case of array of frqp it also provides a second parameter:
the period used to compute freq counter.
The array elements are std_type values linearly encoded in
the update message.
Note: if a remote peer announces an array data_type without
parameters into the table definition message, all updates
on this table will be ignored because we can not
parse update messages consistently.
This patch provides the code to handle arrays of some
standard types (SINT, UINT, ULL and FRQP) in stick table.
This way we could define new "array" data types.
Note: the number of elements of an array was limited
to 100 to put a limit and to ensure that an encoded
update message will continue to fit into a buffer
when the peer protocol will handle such data types.
This patch replaces all advanced data type aliases on
stktable_data_cast calls by standard types.
This way we could call the same stktable_data_cast
regardless of the used advanced data type as long they
are using the same std type.
It also removes all the advanced data type aliases.
It is now possible to set the Netfilter MARK and the TOS field value in all
packets sent to the client from any tcp-request rulesets or the "tcp-response
content" one. To do so, the parsing of "set-mark" and "set-tos" actions are
moved in tcp_act.c and the actions evaluation is handled in dedicated functions.
This patch may be backported as far as 2.2 if necessary.
It is now possible to set the "nice" factor of the current stream from a
"tcp-request content" or "tcp-response content" ruleset. To do so, the
action parsing is moved in stream.c and the action evaluation is handled in
a dedicated function.
This patch may be backported as far as 2.2 if necessary.
It is now possible to set the stream log level from a "tcp-request content"
or "tcp-response content" ruleset. To do so, the action parsing is moved in
stream.c and the action evaluation is handled in a dedicated function.
This patch should fix issue #1306. It may be backported as far as 2.2 if
necessary.
Now we directly use p->queue to get to the queue, which is much more
straightforward. The performance on 100 servers and 16 threads
increased from 560k to 574k RPS, or 2.5%.
A lot more simplifications are possible, but the minimum was done at
this point.
A queue is specific to a server or a proxy, so we don't need to place
this distinction inside all pendconns, it can be in the queue itself.
This commit adds the relevant fields "px" and "sv" into the struct
queue, and initializes them accordingly.
This basically undoes the API changes that were performed by commit
0274286dd ("BUG/MAJOR: server: fix deadlock when changing maxconn via
agent-check") to address the deadlock issue: since process_srv_queue()
doesn't use the server lock anymore, it doesn't need the "server_locked"
argument, so let's get rid of it before it gets used again.
Till now whenever a server or proxy's queue was touched, this server
or proxy's lock was taken. Not only this requires distinct code paths,
but it also causes unnecessary contention with other uses of these locks.
This patch adds a lock inside the "queue" structure that will be used
the same way by the server and the proxy queuing code. The server used
to use a spinlock and the proxy an rwlock, though the queue only used
it for locked writes. This new version uses a spinlock since we don't
need the read lock part here. Tests have not shown any benefit nor cost
in using this one versus the rwlock so we could change later if needed.
The lower contention on the locks increases the performance from 362k
to 374k req/s on 16 threads with 20 servers and leastconn. The gain
with roundrobin even increases by 9%.
This is tagged medium because the lock is changed, but no other part of
the code touches the queues, with nor without locking, so this should
remain invisible.
This reverts commit fcb8bf8650.
The recent changes since 5304669e1 MEDIUM: queue: make
pendconn_process_next_strm() only return the pendconn opened a tiny race
condition between stream_free() and process_srv_queue(), as the pendconn
is accessed outside of the lock, possibly while it's being freed. A
different approach is required.
This reverts commit c83e45e9b0.
The recent changes since 5304669e1 MEDIUM: queue: make
pendconn_process_next_strm() only return the pendconn opened a tiny race
condition between stream_free() and process_srv_queue(), as the pendconn
is accessed outside of the lock, possibly while it's being freed. A
different approach is required.
This basically undoes the API changes that were performed by commit
0274286dd ("BUG/MAJOR: server: fix deadlock when changing maxconn via
agent-check") to address the deadlock issue: since process_srv_queue()
doesn't use the server lock anymore, it doesn't need the "server_locked"
argument, so let's get rid of it before it gets used again.
Till now whenever a server or proxy's queue was touched, this server
or proxy's lock was taken. Not only this requires distinct code paths,
but it also causes unnecessary contention with other uses of these locks.
This patch adds a lock inside the "queue" structure that will be used
the same way by the server and the proxy queuing code. The server used
to use a spinlock and the proxy an rwlock, though the queue only used
it for locked writes. This new version uses a spinlock since we don't
need the read lock part here. Tests have not shown any benefit nor cost
in using this one versus the rwlock so we could change later if needed.
The lower contention on the locks increases the performance from 491k
to 507k req/s on 16 threads with 20 servers and leastconn. The gain
with roundrobin even increases by 6%.
The performance profile changes from this:
13.03% haproxy [.] fwlc_srv_reposition
8.08% haproxy [.] fwlc_get_next_server
3.62% haproxy [.] process_srv_queue
1.78% haproxy [.] pendconn_dequeue
1.74% haproxy [.] pendconn_add
to this:
11.95% haproxy [.] fwlc_srv_reposition
7.57% haproxy [.] fwlc_get_next_server
3.51% haproxy [.] process_srv_queue
1.74% haproxy [.] pendconn_dequeue
1.70% haproxy [.] pendconn_add
At this point the differences are mostly measurement noise.
This is tagged medium because the lock is changed, but no other part of
the code touches the queues, with nor without locking, so this should
remain invisible.
This structure will be common to proxies and servers and will contain
everything needed to handle their respective queues. For now it's only
a tree head, a length and an index.
This essentially reverts commit 2b4370078 ("MINOR: lb/api: let callers
of take_conn/drop_conn tell if they have the lock") that was merged
during 2.4 before the various locks could be eliminated at the lower
layers. Passing that information complicates the cleanup of the queuing
code and it's become useless.
The server_parse_maxconn_change_request locks the server lock. However,
this function can be called via agent-checks or lua code which already
lock it. This bug has been introduced by the following commit :
commit 79a88ba3d0
BUG/MAJOR: server: prevent deadlock when using 'set maxconn server'
This commit tried to fix another deadlock with can occur because
previoulsy server_parse_maxconn_change_request requires the server lock
to be held. However, it may call internally process_srv_queue which also
locks the server lock. The locking policy has thus been updated. The fix
is functional for the CLI 'set maxconn' but fails to address the
agent-check / lua counterparts.
This new issue is fixed in two steps :
- changes from the above commit have been reverted. This means that
server_parse_maxconn_change_request must again be called with the
server lock.
- to counter the deadlock fixed by the above commit, process_srv_queue
now takes an argument to render the server locking optional if the
caller already held it. This is only used by
server_parse_maxconn_change_request.
The above commit was subject to backport up to 1.8. Thus this commit
must be backported in every release where it is already present.
GitHub uses github/linguist to determine the programming language used for each
source file to show statistics and to power the search. In cases of unique file
extensions this is easy, but for `.h` files the situation is less clear as they
are used for C, C++, Objective C and more. In these cases linguist makes use of
heuristics to determine the language.
One of these heuristics for C++ is that the file contains a line beginning with
`try`, only preceded by whitespace indentation. This heuristic matches the long
comment at the bottom of `channel-t.h`, as one sentence includes the word `try`
after a linebreak.
Fix this misdetection by changing the comment to follow the convention that all
lines start with an asterisk.
The function ssl_sock_load_srv_cert will be used at runtime for dynamic
servers. If the cert is not loaded on ckch tree, we try to access it
from the file-system.
Now this access operation is rendered optional by a new function
argument. It is only allowed at parsing time, but will be disabled for
dynamic servers at runtime.
Explicitly call ssl_initialize_random to initialize the random generator
in init() global function. If the initialization fails, the startup is
interrupted.
This commit is in preparation for support of ssl on dynamic servers. To
be able to activate ssl on dynamic servers, it is necessary to ensure
that the random generator is initialized on startup regardless of the
config. It cannot be called at runtime as access to /dev/urandom is
required.
This also has the effect to fix the previous non-consistent behavior.
Indeed, if bind or server in the config are using ssl, the
initialization function was called, and if it failed, the startup was
interrupted. Otherwise, the ssl initialization code could have been
called through the ssl server for lua, but this times without blocking
the startup on error. Or not called at all if lua was deactivated.
The SF_SRV_REUSED flag was set if a stream reused a backend connection.
One of its purpose is to count the total reuse on the backend in
opposition to newly instantiated connection.
However, the flag was diverted from its original purpose since the
following commit :
e8f5f5d8b2
BUG/MEDIUM: servers: Only set SF_SRV_REUSED if the connection if fully ready.
With this change, the flag is not set anymore if the mux is not ready
when a connection is picked for reuse. This can happen for multiplexed
connections which are inserted in the available list as soon as created
in http-reuse always mode. The goal of this change is to not retry
immediately this request in case on an error on the same server if the
reused connection is not fully ready.
This change is justified for the retry timeout handling but it breaks
other places which still uses the flag for its original purpose. Mainly,
in this case the wrong 'connect' backend counter is incremented instead
of the 'reuse' one. The flag is also used in http_return_srv_error and
may have an impact if a http server error is replied for this stream.
To fix this problem, the original purpose of the flag is restored by
setting it unconditionaly when a connection is reused. Additionally, a
new flag SF_SRV_REUSED_ANTICIPATED is created. This flag is set when the
connection is reused but the mux is not ready yet. For the timeout
handling on error, the request is retried immediately only if the stream
reused a connection without this newly anticipated flag.
This must be backported up to 2.1.
When a server relies on a SRV resolution, a task is created to clean it up
(fqdn/port and address) when the SRV resolution is considered as outdated
(based on the resolvers 'timeout' value). It is only possible if the server
inherits outdated info from a state file and is no longer selected to be
attached to a SRV item. Note that most of time, a server is attached to a
SRV item. Thus when the item becomes obsolete, the server is cleaned
up.
It is important to have such task to be sure the server will be free again
to have a chance to be resolved again with fresh information. Of course,
this patch is a workaround to solve a design issue. But there is no other
obvious way to fix it without rewritting all the resolvers part. And it must
be backportable.
This patch relies on following commits:
* MINOR: resolvers: Clean server in a dedicated function when removing a SRV item
* MINOR: resolvers: Remove server from named_servers tree when removing a SRV item
All the series must be backported as far as 2.2 after some observation
period. Backports to 2.0 and 1.8 must be evaluated.
To avoid repeating the same source code, allocating memory and initializing
the per_thr field from the server structure is transferred to a separate
function.
This function appends to a buffer some information from a connection.
This will be used by traces and possibly some debugging as well. A
frontend/backend/server, transport/control layers, source/destination
ip:port, connection pointer and direction are reported depending on
the available information.
With a single process, we don't need to USE_PRIVATE_CACHE, USE_FUTEX
nor USE_PTHREAD_PSHARED anymore. Let's only keep the basic spinlock
to lock between threads.
The relative_pid is always 1. In mworker mode we also have a
child->relative_pid which is always equalt relative_pid, except for a
master (0) or external process (-1), but these types are usually tested
for, except for one place that was amended to carefully check for the
PROC_O_TYPE_WORKER option.
Changes were pretty limited as most usages of relative_pid were for
designating a process in stats output and peers protocol.
Lots of places iterating over nbproc or comparing with nbproc could be
simplified. Further, "bind-process" and "process" parsing that was
already limited to process 1 or "all" or "odd" resulted in a bind_proc
field that was either 0 or 1 during the init phase and later always 1.
All the checks for compatibilities were removed since it's not possible
anymore to run a frontend and a backend on different processes or to
have peers and stick-tables bound on different ones. This is the largest
part of this patch.
The bind_proc field was removed from both the proxy and the receiver
structs.
Since the "process" and "bind-process" directives are still parsed,
configs making use of correct values allowing process 1 will continue
to work.
This is a leftover of a previous attempt that was introduced in 2.4 by
commit d3a88c1c3 ("MEDIUM: connection: close front idling connection on
soft-stop"). It can be backported, as the variable doesn't exist.
Since threads were introduced in 1.8, the USE_PRIVATE_CACHE mode of the
shctx was not updated to use locks. Originally it was meant to disable
sharing between processes, so it removes the lock/unlock instructions.
But with threads enabled, it's not possible to work like this anymore.
It's easy to see that once built with private cache and threads enabled,
sending violent SSL traffic to the the process instantly makes it die.
The HTTP cache is very likely affected as well.
This patch addresses this by falling back to our native spinlocks when
USE_PRIVATE_CACHE is used. In practice we could use them also for other
modes and remove all older implementations, but this patch aims at keeping
the changes very low and easy to backport. A new SHCTX_LOCK label was
added to help with debugging, but OTHER_LOCK might be usable as well
for backports.
An even lighter approach for backports may consist in always declaring
the lock (or reusing "waiters"), and calling pl_take_s() for the lock()
and pl_drop_s() for the unlock() operation. This could even be used in
all modes (process and threads), even when thread support is disabled.
Subsequent patches will further clean up this area.
This patch must be backported to all supported versions since 1.8.
This one was deprecated in 2.3 and marked for removal in 2.5. It suffers
too many limitations compared to threads, and prevents some improvements
from being engaged. Instead of a bypassable startup error, there is now
a hard error.
The parsing code was removed, and very few obvious cases were as well.
The code is deeply rooted at certain places (e.g. "for" loops iterating
from 0 to nbproc) so it will not be that trivial to remove everywhere.
The "bind" and "bind-process" parsers will have to be adjusted, though
maybe not completely changed if we later want to support thread groups
for large NUMA machines. Some stats socket restrictions were removed,
and the doc was updated according to what was done. A few places in the
doc still refer to nbproc and will have to be revisited. The master-worker
code also refers to the process number to distinguish between master and
workers and will have to be carefully adjusted. The MAX_PROCS macro was
reset to 1, this will at least reduce the size of some remaining arrays.
Two regtests were dependieng on this directive, one with an explicit
"nbproc 1" and another one testing the master's CLI using nbproc 4.
Both were adapted.
Commit ab0a5192a ("MEDIUM: config: mark "grace" as deprecated") marked
the "grace" keyword as deprecated in 2.3, tentative removal for 2.4
with a hard deadline in 2.5, so let's remove it and return an error now.
This old and outdated feature was incompatible with soft-stop, reload
and socket transfers, and keeping it forced ugly hacks in the lower
layers of the protocol stack.
This patch add a ref into servers to register them onto the
record answer item used to set their hostnames.
It also adds a head list into 'srvrq' to register servers free
to be affected to a SRV record.
A head of a tree is also added to srvrq to put servers which
present a hotname in server state file. To re-link them fastly
to the matching record as soon an item present the same name.
This results in better performances on SRV record response
parsing.
This is an optimization but it could avoid to trigger the haproxy's
internal wathdog in some circumstances. And for this reason
it should be backported as far we can (2.0 ?)
This patch adds a head list into answer items on servers which use
this record to set their IPs. It makes lookup on duplicated ip faster and
allow to check immediatly if an item is still valid renewing the IP.
This results in better performances on A/AAAA resolutions.
This is an optimization but it could avoid to trigger the haproxy's
internal wathdog in some circumstances. And for this reason
it should be backported as far we can (2.0 ?)
When an HTX block is expanded, a defragmentation may be performed first to
have enough space to copy the new data. When it happens, the meta data of
the HTX message must take account of the new data length but copied data are
still unchanged at this stage (because we need more space to update the
message content). And here there is a bug because the meta data are updated
by the caller. It means that when the blocks content is copied, the new
length is already set. Thus a block larger than the reality is copied and
data outside the buffer may be accessed, leading to a crash.
To fix this bug, htx_defrag() is updated to use an extra argument with the
new meta data to use for the referenced block. Thus the caller does not need
to update the HTX message by itself. However, it still have to update the
data.
Most of time, the bug will be encountered in the HTTP compression
filter. But, even if it is highly unlikely, in theory it is also possible to
hit it when a HTTP header (or only its value) is replaced or when the
start-line is changed.
This patch must be backported as far as 2.0.
A tiny change in commit 6af81f80f ("MEDIUM: errors: implement parsing
context type") triggered an awful bug in gcc 5 and below (4.7.4 to 5.5
confirmed affected, at least on aarch64/mips/x86_64) causing the startup
to loop forever in acl_find_target().
This was tracked down to the acl.c file seeing a different definition
of the struct proxy than other files. The reason for this is that it
sees an unpacked "enum obj_type" (4 bytes) while others see it packed
(1 byte), thus all fields in the struct are having a different
alignment, and the "acl" list is shifted one pointer to the next struct
and seems to loop onto itself.
The commit above did nothing more than adding "enum obj_type *obj" in a
new struct without including obj_type.h, and that was apparently enough
for the compiler to internally declare obj_type as a regular enum and
silently ignore the packed attribute that it discovers later, so depending
on the order of includes, some files would see it as 1 byte and others as
4.
This patch simply adds the missing include but due to the nature of the
bug, probably that creating a special "packed_enum" definition to disable
the packed attribute on such compilers could be a safer option.
No backport is needed as this is only in -dev.
The ifdefs surrounding the "show ssl ocsp-response" functionality that
were supposed to disable the code with BoringSSL were built the wrong
way.
It does not need to be backported.
Now that the modified lockless variant does not need a DWCAS anymore,
there's no reason to keep the much slower locked version, so let's
just get rid of it.
In GH issue #1275, Fabiano Nunes Parente provided a nicely detailed
report showing reproducible crashes under musl. Musl is one of the libs
coming with a simple allocator for which we prefer to keep the shared
cache. On x86 we have a DWCAS so the lockless implementation is enabled
for such libraries.
And this implementation has had a small race since day one: the allocator
will need to read the first object's <next> pointer to place it into the
free list's head. If another thread picks the same element and immediately
releases it, while both the local and the shared pools are too crowded, it
will be freed to the OS. If the libc's allocator immediately releases it,
the memory area is unmapped and we can have a crash while trying to read
that pointer. However there is no problem as long as the item remains
mapped in memory because whatever value found there will not be placed
into the head since the counter will have changed.
The probability for this to happen is extremely low, but as analyzed by
Fabiano, it increases with the buffer size. On 16 threads it's relatively
easy to reproduce with 2MB buffers above 200k req/s, where it should
happen within the first 20 seconds of traffic usually.
This is a structural issue for which there are two non-trivial solutions:
- place a read lock in the alloc call and a barrier made of lock/unlock
in the free() call to force to serialize operations; this will have
a big performance impact since free() is already one of the contention
points;
- change the allocator to use a self-locked head, similar to what is
done in the MT_LISTS. This requires two memory writes to the head
instead of a single one, thus the overhead is exactly one memory
write during alloc and one during free;
This patch implements the second option. A new POOL_DUMMY pointer was
defined for the locked pointer value, allowing to both read and lock it
with a single xchg call. The code was carefully optimized so that the
locked period remains the shortest possible and that bus writes are
avoided as much as possible whenever the lock is held.
Tests show that while a bit slower than the original lockless
implementation on large buffers (2MB), it's 2.6 times faster than both
the no-cache and the locked implementation on such large buffers, and
remains as fast or faster than the all implementations when buffers are
48k or higher. Tests were also run on arm64 with similar results.
Note that this code is not used on modern libcs featuring a fast allocator.
A nice benefit of this change is that since it removes a dependency on
the DWCAS, it will be possible to remove the locked implementation and
replace it with this one, that is then usable on all systems, thus
significantly increasing their performance with large buffers.
Given that lockless pools were introduced in 1.9 (not supported anymore),
this patch will have to be backported as far as 2.0. The code changed
several times in this area and is subject to many ifdefs which will
complicate the backport. What is important is to remove all the DWCAS
code from the shared cache alloc/free lockless code and replace it with
this one. The pool_flush() code is basically the same code as the
allocator, retrieving the whole list at once. If in doubt regarding what
barriers to use in older versions, it's safe to use the generic ones.
This patch depends on the following previous commits:
- MINOR: pools: do not maintain the lock during pool_flush()
- MINOR: pools: call malloc_trim() under thread isolation
- MEDIUM: pools: use a single pool_gc() function for locked and lockless
The last one also removes one occurrence of an unneeded DWCAS in the
code that was incompatible with this fix. The removal of the now unused
seq field will happen in a future patch.
Many thanks to Fabiano for his detailed report, and to Olivier for
his help on this issue.
Since the code was reorganized, DEBUG_UAF was still tested in the locked
pool code despite pools being disabled when DEBUG_UAF is used. Let's move
the test to pool_put_to_os() which is the one that is always called in
this condition.
The impact is only a possible misleading analysis during a troubleshooting
session due to a missing double-frees or free of const area test that is
normally already dealt with by the underlying code anyway. In practice it's
unlikely anyone will ever notice.
This should only be backported to 2.4.
This patch adds the "show ssl ocsp-response [<id>]" CLI command. This
command can be used to display the IDs of the OCSP tree entries along
with details about the entries' certificate ID (issuer's name and key
hash + serial number), or to display the details of a single
ocsp-response if an ID is given. The details displayed in this latter
case are the ones shown by a "openssl ocsp -respin <ocsp-response>
-text" call.
Since commit 04a5a44 ("BUILD: ssl: use HAVE_OPENSSL_KEYLOG instead of
OpenSSL versions") the "tune.ssl.keylog" feature is broken because
HAVE_OPENSSL_KEYLOG does not exist.
Replace this by a HAVE_SSL_KEYLOG which is defined in openssl-compat.h.
Also add an error when not built with the right openssl version.
Must be backported as far as 2.3.
Change the algorithm for the generation of the user messages context
prefix. Remove the dubious API relying on optional printf positional
arguments. This may be non portable, and in fact the CI glibc crashes
with the following error when some arguments are not present in the
format string :
"invalid %N$ use detected".
Now, a fixed buffer attached to the context instance is allocated once
for the program lifetime. Then call repeatedly snprintf with the
optional arguments of context if present to build the context string.
The buffer is deallocated via a per-thread free handler.
This does not need to be backported.
This patch adds the `-cc` (check condition) argument to evaluate conditions on
startup and return the result as the exit code.
As an example this can be used to easily check HAProxy's version in scripts:
haproxy -cc 'version_atleast(2.4)'
This resolves GitHub issue #1246.
Co-authored-by: Tim Duesterhus <tim@bastelstu.be>
Create a parsing_ctx structure. This type is used to store information
about the current file/line parsed. A global context is created and
can be manipulated when haproxy is in STARTING mode. When starting is
over, the context is resetted and should not be accessed anymore.
The user messages buffer is used to store the stderr output after the
starting is over. Each thread has it own user messages buffer. Add some
functions to add a new message, retrieve and clear the content.
The user messages buffer primary goal is to be consulted by CLI
handlers. Each handlers using it must clear the buffer before starting
its operation.
Move functions related to errors output on stderr from log.c to a newly
created errors.c file. It targets print_message and
ha_alert/warning/notice/diag functions and related startup_logs feature.
A memory allocation failure happening in chash_init_server_tree while
trying to allocate a server's lb_nodes item used in consistent hashing
would have resulted in a crash. This function is only called during
configuration parsing.
It was raised in GitHub issue #1233.
It could be backported to all stable branches.
A memory allocation failure happening in mworker_env_to_proc_list when
trying to allocate a mworker_proc would have resulted in a crash. This
function is only called during init.
It was raised in GitHub issue #1233.
It could be backported to all stable branches.
A memory allocation failure happening during peers_register_table would
have resulted in a crash. This function is only called during init.
It was raised in GitHub issue #1233.
It could be backported to all stable branches.
Two calloc calls were not checked in the srv_parse_source function.
Considering that this function could be called at runtime through a
dynamic server creation via the CLI, this could lead to an unfortunate
crash.
It was raised in GitHub issue #1233.
It could be backported to all stable branches even though the runtime
crash could only happen on branches where dynamic server creation is
possible.
b_slow_realign() function may be used to realign a buffer with a given
amount of output data, eventually 0. In such case, the head is set to
0. This function is not designed to be used with input only buffers, like
those used in the muxes. It is the purpose of b_slow_realign_ofs()
function. It does almost the same, realign a buffer. But it do so by setting
the buffer head to a specific offset.
h1 parsing functions (h1_parse_msg_*) returns the number of bytes parsed or
0 if nothing is parsed because an error occurred or some data are
missing. But they never return negative values. Thus, instead of a signed
integer, these function now return a size_t value.
The H1 and FCGI muxes are updated accordingly. Note that h1_parse_msg_data()
has been slightly adapted because the parsing of chunked messages still need
to handle negative values when a parsing error is reported by
h1_parse_chunk_size() or h1_skip_chunk_crlf().
The output of "show map/acl" now contains the 'entry_cnt' value that
represents the count of all the entries for each map/acl, not just the
active ones, which means that it also includes entries currently being
added.
The first item inserted into an ebtree will be inserted directly below
the root, which is a simple struct eb_root which only holds two branch
pointers (left and right).
If we try to find a duplicated entry to this first leaf through a
ebmb_next_dup, our leaf_p pointer will point to the eb_root instead of a
complete eb_node so we cannot look for the bit part of our leaf_p since
it would try to cast our eb_root into an eb_node and perform an out of
bounds access when reading "eb_root_to_node(eb_untag(t,EB_LEFT)))->bit".
This bug was found by address sanitizer running on a CRL hot update VTC
test.
Note that the bug has been there since the import of the eb_next_dup()
and eb_prev_dup() function in 1.5-dev19 by commit 2b5702030 ("MINOR:
ebtree: add new eb_next_dup/eb_prev_dup() functions to visit duplicates").
It can be backported to all stable branches.
The following functions used in CA/CRL file hot update were not defined
in OpenSSL 1.0.2 so they need to be defined in openssl-compat :
- X509_CRL_get_signature_nid
- X509_CRL_get0_lastUpdate
- X509_CRL_get0_nextUpdate
- X509_REVOKED_get0_serialNumber
- X509_REVOKED_get0_revocationDate
This patch adds the "set ssl crl-file" and "commit ssl crl-file"
commands, following the same logic as the certificate and CA file update
equivalents.
When trying to update a Certificate Revocation List (CRL) file via a
"set" command, we start by looking for the entry in the CA file tree and
then building a new cafile_entry out of the payload, without adding it
to the tree yet. It will only be added when a "commit" command is
called.
During a "commit" command, we insert the newly built cafile_entry in the
CA file tree while keeping the previous entry. We then iterate over all
the instances that used the CRL file and rebuild a new one and its
dedicated SSL context for every one of them.
When all the contexts are properly created, the old instances get
replaced by the new ones and the old CRL file is removed from the tree.
The CA files and CRL files are stored in the same cafile_tree so this
patch adds a new field the the cafile_entry structure that specifies the
type of the entry. Since a ca-file can also have some CRL sections, the
type will be based on the option used to load the file and not on its
content (ca-file vs crl-file options).
This patch adds the "set ssl ca-file" and "commit ssl ca-file" commands,
following the same logic as the certificate update equivalents.
When trying to update a ca-file entry via a "set" command, we start by
looking for the entry in the cafile_tree and then building a new
cafile_entry out of the given payload. This new object is not added to
the cafile_tree until "commit" is called.
During a "commit" command, we insert the newly built cafile_entry in the
cafile_tree, while keeping the previous entry as well. We then iterate
over all the instances linked in the old cafile_entry and rebuild a new
ckch instance for every one of them. The newly inserted cafile_entry is
used for all those new instances and their respective SSL contexts.
When all the contexts are properly created, the old instances get
replaced by the new ones and the old cafile_entry is removed from the
tree.
This fixes a subpart of GitHub issue #1057.
Adds a way to insert a new uncommitted cafile_entry in the tree. This
entry will be the one fetched by any lookup in the tree unless the
oldest cafile_entry is explicitely looked for. This way, until a "commit
ssl ca-file" command is completed, there could be two cafile_entries
with the same path in the tree, the original one and the newly updated
one.
The updated CA content coming from the CLI during a ca-file update will
directly be in memory and not on disk so the way CAs are loaded in a
cafile_entry for now (via X509_STORE_load_locations calls) cannot be
used.
This patch adds a way to fill a cafile_entry directly from memory and to
load the contained certificate and CRL sections into an SSL store.
CRL sections are managed as well as certificates in order to mimic the
way CA files are processed when specified in an option. Indeed, when
parsing a CA file given through a ca-file or ca-verify-file option, we
iterate over the different sections in ssl_set_cert_crl_file and load
them regardless of their type. This ensures that a file that was
properly parsed when given as an option will also be accepted by the
CLI.
In order for the link between the cafile_entry and the default ckch
instance to be built, we need to give a pointer to the instance during
the ssl_sock_prepare_ctx call.
Each ca-file entry of the tree will now hold a list of the ckch
instances that use it so that we can iterate over them when updating the
ca-file via a cli command. Since the link between the SSL contexts and
the CA file tree entries is only built during the ssl_sock_prepare_ctx
function, which are called after all the ckch instances are created, we
need to add a little post processing after each ssl_sock_prepare_ctx
that builds the link between the corresponding ckch instance and CA file
tree entries.
In order to manage the ca-file and ca-verify-file options, any ckch
instance can be linked to multiple CA file tree entries and any CA file
entry can link multiple ckch instances. This is done thanks to a
dedicated list of ckch_inst references stored in the CA file tree
entries over which we can iterate (during an update for instance). We
avoid having one of those instances go stale by keeping a list of
references to those references in the instances.
When deleting a ckch_inst, we can then remove all the ckch_inst_link
instances that reference it, and when deleting a cafile_entry, we
iterate over the list of ckch_inst reference and clear the corresponding
entry in their own list of ckch_inst_link references.
This patch moves all the ssl_store related code to ssl_ckch.c since it
will mostly be used there once the CA file update CLI commands are all
implemented. It also makes the cafile_entry structure visible as well as
the cafile_tree.
stdint.h is not as portable as inttypes.h. It doesn't exist at least
on AIX 5.1 and Solaris 7, while inttypes.h is present there and does
include stdint.h on platforms supporting it.
This is equivalent to libslz upstream commit e36710a ("slz: use
inttypes.h instead of stdint.h")
On ARM with native CRC support, no need to inflate the executable with
a 4kB CRC table, let's just drop it.
This is slz upstream commit d8715db20b2968d1f3012a734021c0978758f911.
This function was not used anymore after the atomic updates were
implemented in 2.3, and it must not be used given that it does not
yield and can easily make the process hang for tens of seconds on
large acls/maps. Let's remove it before someone uses it as an
example to implement something else!
It's still very difficult to find all commands starting with a given
keyword like "set", "show" etc. Let's sort the lines by usage message,
this is much more convenient.
The build fails on versions older than 1.0.1d which is the first one
introducing CRYPTO_memcmp(), so let's have a define for this instead
of enabling it whenever USE_OPENSSL is set. One could also wonder why
we're relying on openssl for such a trivial thing, and a simple local
implementation could also allow to restore lexicographic ordering.
Some of the Lua doc and a few places still used "Haproxy" or "HAproxy".
There was even one "HA proxy". A few of them were in an example of VTest
output, indicating that VTest ought to be fixed as well. No big deal but
better address all the remaining ones so that these inconsistencies stop
spreading around.
listener-t comes with openssl just due to the SSL_CTX type that is
declred as a typedef in openssl hence cannot be abstracted at this
level. However connection-t.h doen't need all that just to know that
bind_conf is a struct. Let's declare it with other external types
instead..
These ones are used by virtually every config parser. Not only they
provide no benefit in being inlined, but they imply a very deep
dependency starting at proxy.h, which results for example in task.c
including openssl.
Let's move these two functions to cfgparse.c.
These caddr_* functions were once placed into tools.h in the hope they
would be useful but nobody knows they exist. They could deserve being
moved to their own file with other pointer manipulation functions maybe,
but for now they're the only reason left for stick_table.h to include
tools.h, so let's move them directly there since it's its only user.
This allows to remove tools.h from stick_table.h and slightly reduce
the overall build time.
This function has no business being inlined in stick_table.h since it's
only used at boot time by the config parser. In addition it causes an
undesired dependency on tools.h because it uses parse_time_err(). Let's
move it to stick_table.c.
No idea why this was put inlined into connection.h, it's used only once
for haproxy -vv, and requires tools.h, causing an undesired dependency
from connection.h. Let's move it to connection.c instead where it ought
to have been.
Only mworker uses proc_self, and it was declared in global.h, forcing
users of global.h to include mworker and its dependencies.
Moving it to mworker reduces the preprocessed size of version.c from
170 to 125kB by shrinking the number of local includes from 30 to 16
and the number of system includes from 147 to 132.
The presence of this field causes a long dependency chain because almost
everyone includes global-t.h, and vars include sample_data which include
some system includes as well as HTTP parts.
There is absolutely no reason for having the process-wide variables in
the global struct, let's just move them into vars.c and vars.h. This
reduces from ~190k to ~170k the preprocessed output of version.c.
This will allow some fields to be produced with a higher accuracy when
the requester indicates being able to parse floats. Rates and times are
among the elements which can make sense.
Currently the stats filling function knows nothing about the caller's
needs, so let's pass the STAT_* flags so that it can adapt to the
requester's constraints.
For stats reporting it can be convenient to report floats at low rates
instead of discrete integers. We do have quite some precision since we
currently divide counters by number of milliseconds, so we can usually
add 3 digits after the decimal point.
We already had ultoa_r() and friends but nothing to emit inline floats.
This is now done with ftoa_r() and F2A/F2H. Note that the latter both use
the itoa_str[] as temporary storage and that the HTML format currently is
the exact same as the ASCII one. The trailing zeroes are always timmed so
these outputs are usable in user-visible output.
When using "%f" to print a float, it automatically gets 6 digits after
the decimal point and there's no way to automatically adjust to the
required ones by dropping trailing zeroes. This function does exactly
this and automatically drops the decimal point if all digits after it
were zeroes. This will make numbers more friendly in stats and makes
outputs shorter (e.g. JSON where everything is just a "number").
The function is designed to be easy to use with snprint() and chunks:
snprintf:
flt_trim(buf, 0, snprintf(buf, sizeof(buf), "%f", x));
chunk_printf:
out->data = flt_trim(out->area, 0, chunk_printf(out, "%f", x));
chunk_appendf:
size_t prev_data = out->data;
out->data = flt_trim(out->area, prev_data, chunk_appendf(out, "%f", x));
Add a new proxy capability for proxy with load-balancing capabilities.
This help to differentiate listen/frontend/backend with special proxies
such as peer proxies.
Support experimental actions. It is mandatory to use
'expose-experimental-directives' before to be able to use them.
If such action is present in the config file, the tainted status of the
process is updated. Another tainted status is set when an experimental
action is executed.
Define a new keyword flag KWF_MATCH_PREFIX. This is used to replace the
match_pfx field of action struct.
This has the benefit to have more explicit action declaration, and now
it is possible to quickly implement experimental actions.
Add a new flag to mark a keyword as experimental. An experimental
keyword cannot be used if the global 'expose-experimental-directives' is
not present first.
Only keywords parsed through a standard cfg_keywords lists in
global/proxies section will be automatically detected if declared
experimental. To support a keyword outside of these lists,
check_kw_experimental must be called manually during its parsing.
If an experimental keyword is present in the config, the tainted flag is
updated.
For the moment, no keyword is marked as experimental.
This field will be used to add various mechanism to config parsing.
Currently no flag value is implemented. The following commit will
implement experimental keywords.
Add a global flag named 'tainted'. Its purpose is to report various
status about experimental features used for the current process
lifetime.
By default it is initialized to 0. It can be set/retrieve by a couple of
new functions mark_tainted()/get_tainted(). Once a flag is set, it
cannot be resetted.
Currently, no tainted status is implemented, it will be the subject of
the following commits.
The new function split_version() converts a parsable haproxy version to
an array of integers. The function compare_current_version() compares an
arbitrary version to the current one. These two functions were written
by Thierry Fournier in 2013, and are still usable as-is. They will be
used to write config language predicates.
Till now it was only presented in the version output but could not be
consulted outside of haproxy.c, let's export it as a variable, and set
it to an empty string if not defined.
Let's add a few fields to the global struct to store information about
the current file being processed, the current line number and the current
section. This will be used to retrieve them using special variables.
When the abortonclose option is enabled, to be sure to be immediately
notified when a shutdown is received from the client, the frontend
conn-stream must be sure the mux will wait for read events. To do so, the
CO_RFL_KEEP_RECV flag is set when mux->rcv_buf() is called. This new flag
instructs the mux to wait for read events, regardless its internal state.
This patch is required to fix abortonclose option for H1 client connections.
When channel_may_recv() is called for an HTX stream, the HTX version,
channel_htx_may_recv() is called. This patch is mandatory to fix a bug
related to the abortonclose option.
This adds the necessary flags to permit run-time enabling/disabling of
memory profiling. For now this is disabled.
A few words were added to the management doc about it and recalling that
this is limited to certain OSes.
get_sym_curr_addr() will return the address of the first occurrence of
the given symbol while get_sym_next_addr() will return the address of
the next occurrence of the symbol. These ones return NULL on non-linux,
non-ELF, non-USE_DL.
Implement a safe mechanism to close front idling connection which
prevents the soft-stop to complete. Every h1/h2 front connection is
added in a new per-thread list instance. On shutdown, a new task is
waking up which calls wake mux operation on every connection still
present in the new list.
A new stopping_list attach point has been added in the connection
structure. As this member is only used for frontend connections, it
shared the same union as the session_list reserved for backend
connections.
Move the session_list attach point in an anonymous union. This member is
only used for backend connections. This commit is in preparation for the
support of stopping frontend idling connections which will add another
member to the union.
This change means that a special care must be taken to be sure that only
backend connections manipulate the session_list. A few BUG_ON has been
added as special guard to prevent from misuse.
The purpose of this debugging option was to prevent certain pools from
masking other ones when they were shared. For example, task, http_txn,
h2s, h1s, h1c, session, fcgi_strm, and connection are all 192 bytes and
would normally be mergedi, but not with this option. The problem is that
certain pools are declared multiple times with various parameters, which
are often very close, and due to the way the option works, they're not
shared either. Good examples of this are captures and stick tables. Some
configurations have large numbers of stick-tables of pretty similar types
and it's very common to end up with the following when the option is
enabled:
$ socat - /tmp/sock1 <<< "show pools" | grep stick
- Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753800=56
- Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753880=57
- Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753900=58
- Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753980=59
- Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753a00=60
- Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753a80=61
- Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753b00=62
- Pool sticktables (224 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x753780=55
In addition to not being convenient, it can have important effects on the
memory usage because these pools will not share their entries, so one stick
table cannot allocate from another one's pool.
This patch solves this by going back to the initial goal which was not to
have different pools in the same list. Instead of masking the MAP_F_SHARED
flag, it simply adds a test on the pool's name, and disables pool sharing
if the names differ. This way pools are not shared unless they're of the
same name and size, which doesn't hinder debugging. The same test above
now returns this:
$ socat - /tmp/sock1 <<< "show pools" | grep stick
- Pool sticktables (160 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 7 users, @0x3fadb30 [SHARED]
- Pool sticktables (224 bytes) : 0 allocated (0 bytes), 0 used, needed_avg 0, 0 failures, 1 users, @0x3facaa0 [SHARED]
This is much better. This should probably be backported, in order to limit
the side effects of DEBUG_DONT_SHARE_POOLS being enabled in production.
DragonflyBSD already has an attribute __read_mostly which serves the
same purpose as the one in compiler.h.
No need to be backported as it was added in the current 2.4-dev.
Instead of being able to purge only values older than a specific value,
let's support arbitrary ranges and make pat_ref_purge_older() just be
one special case of this one.
Thanks to the commit "BUG/MINOR: applet: Notify the other side if data were
consumed by an applet", it is no longer necessary to notify the producer when an
applet skips output data. Now, it is the default applet handler responsibility
to take care of that.
Move cpu_map structure outside of the global struct to a global
variable defined in cpuset.c compilation unit. This allows to reorganize
the includes without having to define _GNU_SOURCE everywhere for the
support of the cpu_set_t.
This fixes the compilation with musl libc, most notably used for the
alpine based docker image.
This fixes the github issue #1235.
No need to backport as this feature is new in the current
2.4-dev.
The HTX functions used to add new HTX blocks in a message have been moved to
the header file to inline them in calling functions. These functions are
small enough.
A normalized URI is the internal term used to specify an URI is stored using
the absolute format (scheme + authority + path). For now, it is only used
for H2 clients. It is the default and recommended format for H2 request.
However, it is unusual for H1 servers to receive such URI. So in this case,
we only send the path of the absolute URI. It is performed for H1 servers,
but not for FCGI applications. This patch fixes the difference.
Note that it is not a real bug, because FCGI applications should support
abosolute URI.
Note also a normalized URI is only detected for H2 clients when a request is
received. There is no such test on the H1 side. It means an absolute URI
received from an H1 client will be sent without modification to an H1 server
or a FCGI application.
To make it possible, a dedicated function has been added to get the H1
URI. This function is called by the H1 and the FCGI multiplexer when a
request is sent to a server.
This patch should fix the issue #1232. It must be backported as far as 2.2.
In proxy.c, when process is stopping we try to flush tables content
using 'stktable_trash_oldest'. A check on a counter "table->syncing" was
made to verify if there is no pending resync in progress.
But using multiple threads this counter can be increased by an other thread
only after some delay, so the content of some tables can be trashed earlier and
won't be pushed to the new process (after reload, some tables appear reset and
others don't).
This patch re-names the counter "table->syncing" to "table->refcnt" and
the counter is increased during configuration parsing (registering a table to
a peer section) to protect tables during runtime and until resync of a new
process has succeeded or failed.
The inc/dec operations are now made using atomic operations
because multiple peer sections could refer to the same table in futur.
This fix addresses github #1216.
This patch should be backported on all branches multi-thread support (v >= 1.8)
The compiler cannot guess that tv_sec or tv_usec might have unused
parts, so the multiply by 1000 and the divide by 1000 are both
performed using 64-bit constants to stick to the common type. This is
not needed since we only keep the final 32 bits, let's help the compiler
here by casting these fields to uint. The tv_update_date() code is much
cleaner (48 bytes smaller in the CAS loop) as it avoids some register
spilling at a location where that's really unwanted.
The compilation is currently broken on platform without USE_CPU_AFFINITY
set. An error has been reported by the cygwin build of the CI.
This does not need to be backported.
In file included from include/haproxy/global-t.h:27,
from include/haproxy/global.h:26,
from include/haproxy/fd.h:33,
from src/ev_poll.c:22:
include/haproxy/cpuset-t.h:32:3: error: #error "No cpuset support implemented on this platform"
32 | # error "No cpuset support implemented on this platform"
| ^~~~~
include/haproxy/cpuset-t.h:37:2: error: unknown type name ‘CPUSET_REPR’
37 | CPUSET_REPR cpuset;
| ^~~~~~~~~~~
make: *** [Makefile:944: src/ev_poll.o] Error 1
make: *** Waiting for unfinished jobs....
In file included from include/haproxy/global-t.h:27,
from include/haproxy/global.h:26,
from include/haproxy/fd.h:33,
from include/haproxy/connection.h:30,
from include/haproxy/ssl_sock.h:27,
from src/ssl_sample.c:30:
include/haproxy/cpuset-t.h:32:3: error: #error "No cpuset support implemented on this platform"
32 | # error "No cpuset support implemented on this platform"
| ^~~~~
include/haproxy/cpuset-t.h:37:2: error: unknown type name ‘CPUSET_REPR’
37 | CPUSET_REPR cpuset;
| ^~~~~~~~~~~
make: *** [Makefile:944: src/ssl_sample.o] Error 1
Render numa detection optional with a global configuration statement
'no numa-cpu-mapping'. This can be used if the applied affinity of the
algorithm is not optimal. Also complete the documentation with this new
keyword.
On process startup, the CPU topology of the machine is inspected. If a
multi-socket CPU machine is detected, automatically define the process
affinity on the first node with active cpus. This is done to prevent an
impact on the overall performance of the process in case the topology of
the machine is unknown to the user.
This step is not executed in the following condition :
- a non-null nbthread statement is present
- a restrictive 'cpu-map' statement is present
- the process affinity is already restricted, for example via a taskset
call
For the record, benchmarks were executed on a machine with 2 CPUs
Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz. In both clear and ssl
scenario, the performance were sub-optimal without the automatic
rebinding on a single node.
Allow to specify multiple cpu ids/ranges in parse_cpu_set separated by a
comma. This is optional and must be activated by a parameter.
The comma support is disabled for the parsing of the 'cpu-map' config
statement. However, it will be useful to parse files in sysfs when
inspecting the cpus topology for NUMA automatic process binding.
Create a function thread_cpu_mask_forced. Its purpose is to report if a
restrictive cpu mask is active for the current proces, for example due
to a taskset invocation. It is only implemented for the linux platform
currently.
Use the platform independent type hap_cpuset for the cpu-map statement
parsing. This allow to address CPU index greater than LONGBITS.
Update the documentation to reflect the removal of this limit except for
platforms without cpu_set_t type or equivalent.
Replace the unsigned long parameter by a hap_cpuset. This allows to
address CPU with index greater than LONGBITS.
This function is used to parse the 'cpu-map' statement. However at the
moment, the result is casted back to a long to store it in the global
structure. The next step is to replace ulong in in cpu_map in the
global structure with hap_cpuset.
This module can be used to manipulate a cpu sets in a platform agnostic
way. Use the type cpu_set_t/cpuset_t if available on the platform, or
fallback to unsigned long, which limits de facto the maximum cpu index
to LONGBITS.
As we now embed the library we don't need to support the older 1.0 API
any more, so we can remove the explicit calls to slz_make_crc_table()
and slz_prepare_dist_table().
Now that SLZ is merged, let's update the makefile and compression
files to use it. As a result, SLZ_INC and SLZ_LIB are neither defined
nor used anymore.
USE_SLZ is enabled by default ("USE_SLZ=default") and can be disabled
by passing "USE_SLZ=" or by enabling USE_ZLIB=1.
The doc was updated to reflect the changes.
SLZ is rarely packaged by distros and there have been complaints about
the CPU and memory usage of ZLIB, leading to some suggestions to better
address the issue by simply integrating SLZ into the tree (just 3 files).
See discussions below:
https://www.mail-archive.com/haproxy@formilux.org/msg38037.htmlhttps://www.mail-archive.com/haproxy@formilux.org/msg40079.htmlhttps://www.mail-archive.com/haproxy@formilux.org/msg40365.html
This patch does just this, after minor adjustments to these files:
- tables.h was renamed to slz-tables.h
- tables.h had the precomputed tables removed since not used here
- slz.c uses includes <import/slz*> instead of "slz*.h"
The slz commit imported here was b06c172 ("slz: avoid a build warning
with -Wimplicit-fallthrough"). No other change was performed either to
SLZ nor to haproxy at this point so that this operation may be replicated
if needed for a future version.
This normalizer removes "/./" segments from the path component.
Usually the dot refers to the current directory which renders those segments redundant.
See GitHub Issue #714.
The current "ADD" vs "ADDQ" is confusing because when thinking in terms
of appending at the end of a list, "ADD" naturally comes to mind, but
here it does the opposite, it inserts. Several times already it's been
incorrectly used where ADDQ was expected, the latest of which was a
fortunate accident explained in 6fa922562 ("CLEANUP: stream: explain
why we queue the stream at the head of the server list").
Let's use more explicit (but slightly longer) names now:
LIST_ADD -> LIST_INSERT
LIST_ADDQ -> LIST_APPEND
LIST_ADDED -> LIST_INLIST
LIST_DEL -> LIST_DELETE
The same is true for MT_LISTs, including their "TRY" variant.
LIST_DEL_INIT keeps its short name to encourage to use it instead of the
lazier LIST_DELETE which is often less safe.
The change is large (~674 non-comment entries) but is mechanical enough
to remain safe. No permutation was performed, so any out-of-tree code
can easily map older names to new ones.
The list doc was updated.
Amaury noticed that I managed to break the build of DEBUG_FAIL_ALLOC
for the second time with 207c09509 ("MINOR: pools: move the fault
injector to __pool_alloc()"). The joy of endlessly reworking patch
sets... No backport is needed, that was in the just merged cleanup
series.
By having a pair of dummy pool_get_from_cache() and pool_put_to_cache()
we can remove some ugly ifdefs, so let's do this. We've already done it
for the shared cache.
This function has become too big (251 bytes) and is now hurting
performance a lot, with up to 4% request rate being lost over the last
pool changes. Let's move it to pool.c as a regular function. Other
attempts were made to cut it in half but it's still inefficient. Doing
this results in saving ~90kB of object code, and even 112kB since the
pool changes, with code that is even slightly faster!
Conversely, pool_get_from_cache(), which remains half of this size, is
still faster inlined, likely in part due to the immediate use of the
returned pointer afterwards.
Since pool_get_from_cache() and pool_put_to_cache() were now only wrappers
to the local cache versions which do all the job, let's merge them together
so that there is no more local-cache specific function.
Now when pool_get_from_local_cache() fails, it automatically falls back
to pool_get_from_shared_cache(), which used to always be done in
pool_get_from_cache(). Thus now the API is simpler as we always allocate
and free from/to the local caches.
Till now it used to call it only if there were not too many objects into
the local cache otherwise would send the latest one directly into the
shared cache. Now it always sends to the local cache and it's up to the
local cache to free its oldest objects. From a cache freshness perspective
it's better this way since we always evict cold objects instead of hot
ones. From an API perspective it's better because it will help make the
shared cache invisible to the public API.
These two functions are now responsible for allocating directly from
the cache and releasing to the cache.
Now the pool_alloc() function simply does this:
if cache enabled
return pool_alloc_from_cache() if no NULL
return pool_alloc_nocache() otherwise
and the pool_free() function does this:
if cache enabled
pool_put_to_cache()
else
pool_free_nocache()
For now this only introduces these two functions without changing anything
else, but the goal is to soon allow to make them implementation-specific.
Continuing the unification of local and shared pools, now the usage of
pools is governed by CONFIG_HAP_POOLS without which allocations and
releases are performed directly from the OS using pool_alloc_nocache()
and pool_free_nocache().
There are two levels of freeing to the OS:
- code that wants to keep the pool's usage counters updated uses
pool_free_area() and handles the counters itself. That's what
pool_put_to_shared_cache() does in the no-global-pools case.
- code that does not want to update the counters because they were
already updated only calls pool_free_area().
Let's extract these calls to establish the symmetry with pool_get_from_os()
and pool_alloc_nocache(), resulting in pool_put_to_os() (which only updates
the allocated counter) and pool_free_nocache() (which also updates the used
counter). This will later allow to simplify the generic code.
Calling pool_free_area() inside a lock in pool_put_to_shared_cache() is
a very bad idea. Fortunately this only happens on the lowest end platforms
which almost never use threads or in very small counts.
This change consists in zeroing the pointer once already released to the
cache in the first test so that the second stage knows if it needs to
pass it to the OS or not. This has slightly reduced the length of the
A part of the code cannot be factored out because it still uses non-atomic
inc/dec for pool->used and pool->allocated as these are located under the
pool's lock. While it can make sense in terms of bus cycles, it does not
make sense in terms of code normalization. Further, some operations were
still performed under a lock that could be totally removed via the use of
atomic ops.
There is still one occurrence in pool_put_to_shared_cache() in the locked
code where pool_free_area() is called under the lock, which must absolutely
be fixed.
Now there's one part dealing with the allocation itself and keeping
counters up to date, and another one on top of it to return such an
allocated pointer to the user and update the use count and stats.
This is in anticipation for being able to group cache-related parts.
The release code is still done at once.
Till now it was limited to objects allocated from the OS which means
it had little use as soon as pools were enabled. Let's move it upper
in the layers so that any code can benefit from fault injection. In
addition this allows to pass a new flag POOL_F_NO_FAIL to disable it
if some callers prefer a no-failure approach.
Now the multi-level cache becomes more visible:
pool_get_from_local_cache()
pool_put_to_local_cache()
pool_get_from_shared_cache()
pool_put_to_shared_cache()
The functions were rightfully called from/to_cache when the thread-local
cache was considered as the only cache, but this is getting terribly
confusing. Let's call them from/to local_cache to make it clear that
it is not related with the shared cache.
As a side note, since pool_evict_from_cache() used not to work for a
particular pool but for all of them at once, it was renamed to
pool_evict_from_local_caches() (plural form).
This is exactly what it is, the entry is retrieved from the shared
cache when it is defined. The implementation that is enabled with
CONFIG_HAP_NO_GLOBAL_POOLS continues to return NULL.
Now that __pool_alloc() only surrounds __pool_get_first() with the lock,
let's move it to the only variant that requires it and remove the ugly
ifdefs from the function. This is safe because nobody else calls this
function.
In __pool_alloc(), historically we used to use factor out the
pool's lock between __pool_get_first() and __pool_refill_alloc(),
resulting in real malloc() or mmap() calls being performed under
the pool lock (for platforms using the locked shared pools).
As this is not needed anymore, let's move the call out of the
lock, it may improve allocation patterns on some platforms. This
also makes __pool_alloc() cleaner as we see a first attempt to
allocate from the local cache, then a second from the shared
cache then a reall allocation.
They were strictly equivalent, let's remerge them and rename them to
pool_alloc_nocache() as it's the call which performs a real allocation
which does not check nor update the cache. The only difference in the
past was the former taking the lock and not the second but now the lock
is not needed anymore at this stage since the pool's list is not touched.
In addition, given that the "avail" argument is no longer used by the
function nor by its callers, let's drop it.
This is a first step towards unifying all the fallback code. Right now
these two functions are the only ones which do not update the needed_avg
rate counter since there's currently no shared pool kept when using them.
But their code is similar to what could be used everywhere except for
this one, so let's make them capable of maintaining usage statistics.
As a side effect the needed field in "show pools" will now be populated.
The mem_should_fail() call enabled by DEBUG_FAIL_ALLOC used to be placed
only in the no-cache version of the allocator. Now we can generalize it
to all modes and remove the exclusive test on CONFIG_HAP_NO_GLOBAL_POOLS.
We're going to make the local pool always present unless pools are
completely disabled. This means that pools are always enabled by
default, regardless of the use of threads. Let's drop this notion
of "local" pools and make it just "pool". The equivalent debug
option becomes DEBUG_NO_POOLS instead of DEBUG_NO_LOCAL_POOLS.
For now this changes nothing except the option and dropping the
dependency on USE_THREAD.
Initially per-thread pool caches were stored into a fixed-size array.
But this was a bit ugly because the last allocated pools were not able
to benefit from the cache at all. As a work around to preserve
performance, a size of 64 cacheable pools was set by default (there
are 51 pools at the moment, excluding any addon and debugging code),
so all in-tree pools were covered, at the expense of higher memory
usage.
In addition an index had to be calculated for each pool, and was used
to acces the pool cache head into that array. The pool index was not
even stored into the pools so it was required to determine it to access
the cache when the pool was already known.
This patch changes this by moving the pool cache head into the pool
head itself. This way it is certain that each pool will have its own
cache. This removes the need for index calculation.
The pool cache head is 32 bytes long so it was aligned to 64B to avoid
false sharing between threads. The extra cost is not huge (~2kB more
per pool than before), and we'll make better use of that space soon.
The pool cache head contains the size, which should probably be removed
since it's already in the pool's head.
In commit fb117e6a8 ("MEDIUM: memory: don't let pool_put_to_cache() free
the objects itself") pool_evict_from_cache() was introduced with no
argument, yet the only call place passes it the pool, the pointer and
the index number!
Let's remove these as they even let the reader think that the function
does something specific to the current pool while it's not the case.
This patch renames all existing uri-normalizers into a more consistent naming
scheme:
1. The part of the URI that is being touched.
2. The modification being performed as an explicit verb.
This normalizer merges `../` path segments with the predecing segment, removing
both the preceding segment and the `../`.
Empty segments do not receive special treatment. The `merge-slashes` normalizer
should be executed first.
See GitHub Issue #714.
When a thread ends its harmeless period, we must only consider running
threads when testing threads_want_rdv_mask mask. To do so, we reintroduce
all_threads_mask mask in the bitwise operation (It was removed to fix a
deadlock).
Note that for now it is useless because there is no way to stop threads or
to have threads reserved for another task. But it is safer this way to avoid
bugs in the future.
A previous patch was pushed to fix a deadlock when an isolated thread ends
its harmless period (a9a9e9aac ["BUG/MEDIUM: thread: Fix a deadlock if an
isolated thread is marked as harmless"]). But, unfortunately, the fix is
incomplete. The same must be done in the outer loop, in
thread_harmless_end() function. The current thread must be ignored when
threads_want_rdv_mask mask is tested.
This patch must also be backported as far as 2.0.
This library is required for the subsequent patch which adds
the JSON query possibility.
It is necessary to change the include statement in "src/mjson.c"
because the imported includes in haproxy are in "include/import"
orig: #include "mjson.h"
new: #include <import/mjson.h>
ub64dec and ub64enc are the base64url equivalent of b64dec and base64
converters. base64url encoding is the "URL and Filename Safe Alphabet"
variant of base64 encoding. It is also used in in JWT (JSON Web Token)
standard.
RFC1421 mention in base64.c file is deprecated so it was replaced with
RFC4648 to which existing converters, base64/b64dec, still apply.
Example:
HAProxy:
http-request return content-type text/plain lf-string %[req.hdr(Authorization),word(2,.),ub64dec]
Client:
Token=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJ1c2VyIjoiZm9vIiwia2V5IjoiY2hhZTZBaFhhaTZlIn0.5VsVj7mdxVvo1wP5c0dVHnr-S_khnIdFkThqvwukmdg
$ curl -H "Authorization: Bearer ${TOKEN}" http://haproxy.local
{"user":"foo","key":"chae6AhXai6e"}
Allocation error are now handled in bind_conf_alloc() functions. Thus
callers, when not already done, are also updated to catch NULL return value.
This patch may be backported (at least partially) to all stable
versions. However, it only fix errors durung configuration parsing. Thus it
is not mandatory.
Add the trace support for the checks. Only tcp-check based health-checks are
supported, including the agent-check.
In traces, the first argument is always a check object. So it is easy to get
all info related to the check. The tcp-check ruleset, the conn-stream and
the connection, the server state...
Olivier spotted that I messed up during a rebase of commit 92c059c2a
("MINOR: atomic: implement native BTS/BTR for x86"), losing the x86
version of the BTS/BTR and leaving the generic version for it instead
of having this block in the #else. Since this variant is not used for
now it was easy to overlook it. Let's re-implement it here.
The time initialization was made a bit complex because we rely on a
dummy negative argument to reset all fields, leaving no distinction
between process-level initialization and thread-level initialization.
This patch changes this by introducing two functions, one for the
process and the second one for the threads. This removes ambigous
test and makes sure that the relevant fields are always initialized
exactly once. This also offers a better solution to the bug fixed in
commit b48e7c001 ("BUG/MEDIUM: time: make sure to always initialize
the global tick") as there is no more special values for global_now_ms.
It's simple enough to be backported if any other time-related issues
are encountered in stable versions in the future.
It was only used by freq_ctr and is not used anymore. In addition the
local curr_sec_ms was removed, as well as the equivalent extern
definitions which did not exist anymore either.
update_freq_ctr_period() was still not very clean and didn't wait for
the rotation lock to be dropped before trying again, thus maintaining
the contention at a high level. In addition, the rotation update was
made in three steps, which are not very efficient in terms of bus
cycles.
Here the wait loop was reworked so that the fast path remains short
and that the contended path waits for the lock to be dropped before
attempting another write, but it only waits a relax cycle before
attempting a read. The rotation block was simplified to remove a
test that was already validated by the first loop, and so that the
retrieval of the current period, its reset and its increment are all
performed in a single atomic op and the store to the previous period
is performed immediately after.
All this results in significantly smaller code for the inline function
(~1kB total) and a shorter critical path.
When counters are rotated, there is contention between the threads which
can slow down the operation of the thread performing the rotation. Let's
apply a cpu_relax there to let the first thread finish faster.
It remains cumbersome to preserve two versions of the freq counters and
two different internal clocks just for this. In addition, the savings
from using two different mechanisms are not that important as the only
saving is a divide that is replaced by a multiply, but now thanks to
the freq_ctr_total() unificaiton the code could also be simplified to
optimize it in case of constants.
This patch turns all non-period freq_ctr functions to static inlines
which call the period-based ones with a period of 1 second. A direct
benefit is that a single internal clock is now needed for any counter
and that they now all rely on ticks.
These 1-second counters are essentially used to report request rates
and to enforce a connection rate limitation in listeners. It was
verified that these continue to work like before.
Both structures are identical except the name of the field starting
the period and its description. Let's call them all freq_ctr and the
period's start "curr_tick" which is generic.
This is only a temporary change and fields are expected to remain
the same with no code change (verified).
There was still no function to compute a wait time for periods, let's
implement it on top of freq_ctr_total() as we'll soon need it for the
per-second one. The divide here is applied on the frequency so that it
will be replaced with a reciprocal multiply when constant.
This one is the easiest to implement, it just requires a call and a
divide of the result. Anti-flapping correction for low-rates was
preserved.
Now calls using a constant period will be able to use a reciprocal
multiply for the period instead of a divide.