The purpose is to enable the dumping of a backtrace on BUG_ON(). While
it's very useful to know that a condition was met, very often some
caller context is missing to figure how the condition could happen.
From now on, on systems featuring backtrace, a backtrace of the calling
thread will also be dumped to stderr in addition to the unexpected
condition. This will help users of DEBUG_STRICT as they'll most often
find this backtrace in their logs even if they can't find their core
file.
A new "debug dev bug" expert-mode CLI command was added to test the
feature.
This function calls the ha_dump_backtrace() function with a locally
allocated buffer and sends the output slightly indented to fd #2. It's
meant to be used as an emergency backtrace dump.
The backtrace dumping code was located into the thread dump function
but it looks particularly convenient to be able to call it to produce
a dump in other situations, so let's move it to its own function and
make sure it's called last in the function so that we can benefit from
tail merging to save one entry.
In order to simplify the code and remove annoying ifdefs everywhere,
let's always export my_backtrace() and make it adapt to the situation
and return zero if not supported. A small update in the thread dump
function was needed to make sure we don't use its results if it fails
now.
use `stats_fill_fe_stats` when possible to avoid duplicating code; make
use of field selector to get the needed field only.
this should not introduce any difference of output.
Signed-off-by: William Dauchy <wdauchy@gmail.com>
prometheus approach requires to output all values for a given metric
name; meaning we iterate through all metrics, and then iterate in the
inner loop on all objects for this metric.
In order to allow more code reuse, adapt the stats API to be able to
select one field or fill them all otherwise.
From this patch it should be possible to remove most of the duplicate
code on prometheuse side for the frontend.
Signed-off-by: William Dauchy <wdauchy@gmail.com>
Another patch in order to try to reconciliate haproxy stats and
prometheus. Here I'm adding a proper start time field in order to make
proper use of uptime field.
That being done we can move the calculation in `fill_info`
Signed-off-by: William Dauchy <wdauchy@gmail.com>
in order to prepare a possible merge of fields between haproxy stats and
prometheus, duplicate 3 fields:
INF_MEMMAX
INF_POOL_ALLOC
INF_POOL_USED
Those were specifically named in MB unit which is not what prometheus
recommends. We therefore used them but changed the unit while doing the
calculation. It created a specific case for that, up to the description.
This patch:
- removes some possible confusion, i.e. using MB field for bytes
- will permit an easier merge of fields such as description
First consequence for now, is that we can remove the calculation on
prometheus side and move it on `fill_info`.
Signed-off-by: William Dauchy <wdauchy@gmail.com>
The MUX_ES_NOTIMPL_ERR exit status is added to allow the multiplexers to
report errors about not implemented features. This will be used by the H1
mux to return 501-not-implemented errors.
Add the support for the 501-not-implemented status code with the
corresponding default message. The documentation is updated accordingly
because it is now part of status codes HAProxy may emit via an errorfile or
a deny/return HTTP action.
When a tcpcheck ruleset uses multiple connections, the existing one must be
closed and destroyed before openning the new one. This part is handled in
the tcpcheck_main() function, when called from the wake callback function
(wake_srv_chk). But it is indeed a problem, because this function may be
called from the mux layer. This means a mux may call the wake callback
function of the data layer, which may release the connection and the mux. It
is easy to see how it is hazardous. And actually, depending on the
scheduling, it leads to crashes.
Thus, we must avoid to release the connection in the wake callback context,
and move this part in the check's process function instead. To do so, we
rely on the CHK_ST_CLOSE_CONN flags. When a connection must be replaced by a
new one, this flag is set on the check, in tcpcheck_main() function, and the
check's task is woken up. Then, the connection is really closed in
process_chk_conn() function.
This patch must be backported as far as 2.2, with some adaptations however
because the code is not exactly the same.
Now the show_fd helpers at the transport and mux levels return an integer
which indicates whether or not the inspected entry looks suspicious. When
an entry is reported as suspicious, "show fd" will suffix it with an
exclamation mark ('!') in the dump, that is supposed to help detecting
them.
For now, helpers were adjusted to adapt to the new API but none of them
reports any suspicious entry yet.
Just like we did for the muxes, now the transport layers will have the
ability to provide helpers to report more detailed information about their
internal context. When the helper is not known, the pointer continues to
be dumped as-is if it's not NULL. This way a transport with no context nor
dump function will not add a useless "xprt_ctx=(nil)" but the pointer will
be emitted if valid or if a helper is defined.
When 0c439d895 ("BUILD: tools: make resolve_sym_name() return a const")
was written, the pointer argument ought to have been turned to const for
more flexibility. Let's do it now.
It's about the third time I get confused by these functions, half of
which manipulate the reference as a whole and those manipulating only
an entry. For me "pat_ref_commit" means committing the pattern reference,
not just an element, so let's rename it. A number of other ones should
really be renamed before 2.4 gets released :-/
This function must be used to emit an alert if a proxy does not have at
least one of the requested capabilities. An additional message may be
appended to the alert.
This is from the output of codespell. It's done at once over a bunch
of files and only affects comments, so there is nothing user-visible.
No backport needed.
commit c55a626217 ("MINOR: contrib/prometheus-exporter: Add
missing global and per-server metrics") is renaming two metrics between
v2.2 and v2.3:
server_idle_connections_current
server_idle_connections_limit
It is breaking some tools which are making use of those metrics while
supporting several haproxy versions. This build_info will permit tools
which make use of metrics to be able to match the haproxy version and
change the list of expected metrics. This was possible using the haproxy
stats socket but not with prometheus export.
This patch follows prometheus best pratices to export specific software
informations. It is adding a new field `build_info` so we can extend it
to other parameters if needed in the future.
example output:
# HELP haproxy_process_build_info HAProxy build info.
# TYPE haproxy_process_build_info gauge
haproxy_process_build_info{version="2.4-dev5-2e1a3f-5"} 1
Even though it is not a bugfix, this patch will make more sense when
backported up to >= 2.0
Signed-off-by: William Dauchy <wdauchy@gmail.com>
This allows using the address of the server rather than the name of the
server for keeping track of servers in a backend for stickiness.
The peers code was also extended to support feeding the dictionary using
this key instead of the name.
Fixes#814
The accept-encoding normalizer now explicitely manages a subset of
encodings which will all have their own bit in the encoding bitmap
stored in the cache entry. This way two requests with the same primary
key will be served the same cache entry if they both explicitely accept
the stored response's encoding, even if their respective secondary keys
are not the same and do not match the stored response's one.
The actual hash of the accept-encoding will still be used if the
response's encoding is unmanaged.
The encoding matching and the encoding weight parsing are done for every
subpart of the accept-encoding values, and a bitmap of accepted
encodings is built for every request. It is then tested upon any stored
response that has the same primary key until one with an accepted
encoding is found.
The specific "identity" and "*" accept-encoding values are managed too.
When storing a response in the key, we also parse the content-encoding
header in order to only set the response's corresponding encoding's bit
in its cache_entry encoding bitmap.
This patch fixes GitHub issue #988.
It does not need to be backported.
If any of the secondary hash normalizing functions raises an error, the
secondary hash will be unusable. In this case, the response will not be
stored anymore.
Add traces to have an idea why this function may fail. In fact
in never fails when the passed parameters are correct, especially the
lengths. This is not the case when a packet is not correctly built
before being encrypted.
Even if the size of frames built by qc_build_frm() are computed so that
not to overflow a buffer, do not rely on this and always makes a packet
build fails if we could not build a frame.
Also add traces to have an idea where qc_build_frm() fails.
Fixes a memory leak in qc_build_phdshk_apkt().
Remove ->ifcdata which was there to control the CRYPTO data sent to the
peer so that not to saturate its reception buffer. This was a sort
of flow control.
Add ->prep_in_flight counter to the QUIC path struct to control the
number of bytes prepared to be sent so that not to saturare the
congestion control window. This counter is increased each time a
packet was built. This has nothing to see with ->in_flight which
is the real in flight number of bytes which have really been sent.
We are olbiged to maintain two such counters to know how many bytes
of data we can prepared before sending them.
Modify traces consequently which were useful to diagnose issues about
the congestion control window usage.
As there is a lot of information in this protocol, this is not
easy to make the traces readable. We remove here a few of them and
shorten some line shortening the variable names.
This patch adds QUIC structs to server struct so that to make the QUIC code
compile. Also initializes the ebtree to store the connections by connection
IDs.
This patch adds a quic_transport_params struct to bind_conf struct
used for the listeners. This is to store the QUIC transport parameters
for the listeners. Also initializes them when calling str2listener().
Before str2sa_range() it's too early to figure we're going to speak QUIC,
and after it's too late as listeners are already created. So it seems that
doing it in str2listener() when the protocol is discovered is the best
place.
Also adds two ebtrees to the underlying receivers to store the connection
by connections IDs (one for the original connection IDs, and another
one for the definitive connection IDs which really identify the connections.
However it doesn't seem normal that it is stored in the receiver nor the
listener. There should be a private context in the listener so that
protocols can store internal information. This element should in
fact be the listener handle.
Something still feels wrong, and probably we'll have to make QUIC and
SSL co-exist: a proof of this is that there's some explicit code in
bind_parse_ssl() to prevent the "ssl" keyword from replacing the xprt.
This patch imports all the definitions for QUIC protocol with few modifications
from 20200720-quic branch of quic-dev repository found at
https://github.com/haproxytech/quic-dev.
QUIC needs to initialize its BIO and SSL session the same way as for SSL over TCP
connections. It needs also to use the same ClientHello callback.
This patch only exports functions and variables shared between QUIC and SSL/TCP
connections.
We add src/quic_sock.c QUIC specific socket management functions as callbacks
for the control layer: ->accept_conn, ->default_iocb and ->rx_listening.
accept_conn() will have to be defined. The default I/O handler only recvfrom()
the datagrams received. Furthermore, ->rx_listening callback always returns 1 at
this time but should returns 0 when reloading the processus.
This is a simple patch to prepare the integration of QUIC support to come.
quic_conn struct is supposed to embed any QUIC specific information for a QUIC
connection.
As QUIC is a connection oriented protocol, this file is almost a copy of
proto_tcp without TCP specific features. To suspend/resume a QUIC receiver
we proceed the same way as for proto_udp receivers.
With the recent updates to the listeners, we don't need a specific set of
quic*_add_listener() functions, the default ones are sufficient. The fields
declaration were reordered to make the various layers more visible like in
other protocols.
udp_suspend_receiver/udp_resume_receiver are up-to-date (the check for INHERITED
is present) and the code being UDP-specific, it's normal to use UDP here.
Note that in the future we might more reasily reference stacked layers so that
there's no more need for specifying the pointer here.
This way we make all xxhash functions inline, with implementations being
directly included within xxhash.h.
Makefile is updated as well, since we don't need to compile and link
xxhash.o anymore.
Inlining should improve performance on small data inputs.
A new XXH3 variant of hash functions shows a noticeable improvement in
performance (especially on small data), and also brings 128-bit support,
better inlining and streaming capabilities.
Performance comparison is available here:
https://github.com/Cyan4973/xxHash/wiki/Performance-comparison
The assembler on MacOS aarch64 interprets ; as the beginning of comments,
so it is not suitable for separating instructions in inline asm. Use \n
instead.
This should be backported to 2.3, 2.2, 2.1, 2.0 and 1.9.
The MAX_DELAY_MS which is set an upper limit to the poll wait time and
force a wakeup this often used to be set to 1 second in order to easily
spot and correct time drifts. This was added 12 years ago at an era
where virtual machines were starting to become common in server
environments while not working particularly well. Nowadays, such issues
are not as common anymore, however forcing 64 threads to wake up every
single second starts to make the process visible on otherwise idle
systems. Let's increase this wakeup interval to one minute. In the worst
case it will make idle threads wake every second, which remains low.
If this is not sufficient anymore on some systems, another approach
would consist in implementing a deep-sleep mode which only triggers
after a while and which is always disabled if any time drift is
observed.
This fixes building hpack from contrib, which failed because of the
undeclared VAR_ARRAY:
make -C contrib/hpack
...
cc -O2 -Wall -g -I../../include -fwrapv -fno-strict-aliasing -c -o gen-enc.o gen-enc.c
In file included from gen-enc.c:18:
../../include/haproxy/hpack-tbl-t.h:105:23: error: 'VAR_ARRAY' undeclared here (not in a function)
105 | struct hpack_dte dte[VAR_ARRAY]; /* dynamic table entries */
...
As discussed in the thread below, let's redefine VAR_ARRAY in this file
so that it remains self-sustaining:
https://www.mail-archive.com/haproxy@formilux.org/msg39212.html
SSL_CTX_get0_privatekey is openssl/boringssl specific function present
since openssl-1.0.2, let us define readable guard for it, not depending
on HA_OPENSSL_VERSION
As Ilya reported in issue #998, gcc 11 complains about misleading code
indentation which is in fact caused by dead assignments to zero after
a loop which stops on zero. Let's clean both of these.
Due to the addition of the OpenTracing filter it is necessary to define
ARGC_OT enum. This value is used in the functions fmt_directive() and
smp_resolve_args().
The OpenTracing filter uses several internal HAProxy functions to work
with variables and therefore requires two static local HAProxy functions,
var_accounting_diff() and var_clear(), to be declared global.
In fact, the var_clear() function was not originally defined as static,
but it lacked a declaration.
SSL_CTX_add_server_custom_ext is openssl specific function present
since openssl-1.0.2, let us define readable guard for it, not depending
on HA_OPENSSL_VERSION
Right now the connection subscribe/unsubscribe code needs to manipulate
FDs, which is not compatible with QUIC. In practice what we need there
is to be able to either subscribe or wake up depending on readiness at
the moment of subscription.
This commit introduces two new functions at the control layer, which are
provided by the socket code, to check for FD readiness or subscribe to it
at the control layer. For now it's not used.
Now we don't touch the fd anymore there, instead we rely on the ->drain()
provided by the control layer. As such the function was renamed to
conn_ctrl_drain().
This is what we need to drain pending incoming data from an connection.
The code was taken from conn_sock_drain() without the connection-specific
stuff. It still takes a connection for now for API simplicity.
conn_fd_handler() is 100% specific to socket code. It's about time
it moves to sock.c which manipulates socket FDs. With it comes
conn_fd_check() which tests for the socket's readiness. The ugly
connection status check at the end of the iocb was moved to an inlined
function in connection.h so that if we need it for other socket layers
it's not too hard to reuse.
The code was really only moved and not changed at all.
The send() loop present in this function and the error handling is already
present in raw_sock_from_buf(). Let's rely on it instead and stop touching
the FD from this place. The send flag was changed to use a more agnostic
CO_SFL_*. The name was changed to "conn_ctrl_send()" to remind that it's
meant to be used to send at the lowest level.
These are two other areas where this fd_stop_recv()/fd_stop_send() makes no
sense anymore. Both happen by definition while the FD is *not* subscribed,
since nowadays it's subscribed after failing recv()/send(), in which case
we cannot close.
These functions used to disable polling for writes when shutting down
but this is no longer used as it still happens later when closing if the
connection was subscribed to FD events. Let's just remove this fake and
undesired dependency on the FD layer.
Add a new http-request action 'set-timeout [server/tunnel]'. This action
can be used to update the server or tunnel timeout of a stream. It takes
two parameters, the timeout name to update and the new timeout value.
This rule is only valid for a proxy with backend capabilities. The
timeout value cannot be null. A sample expression can also be used
instead of a plain value.
Allow the modification of the tunnel timeout on the stream side.
Use a new field in the stream for the tunnel timeout. It is initialized
by the tunnel timeout from backend unless it has already been set by a
set-timeout tunnel rule.
These functions are not used anymore and were quite confusing given that
their names reflected their original role and not the current ones. Let's
kill them before they inspire anyone.
We had cs_close() which forces a CS_SHR_RESET mode on the read side,
and due to this there are a few call places in the checks which
perform a manual call to conn_sock_drain() before calling cs_close().
This is absurd by principle, and it can be counter-productive in the
case of a mux where this could even cause the opposite of the desired
effect by deleting pending frames on the socket before closing.
Let's add cs_drain_and_close() which uses the CS_SHR_DRAIN mode to
prepare this.
QUIC will rely on UDP at the receiver level, and will need these functions
to suspend/resume the receivers. In the future, protocol chaining may
simplify this.
In conn_ctrl_init() and conn_ctrl_close() we now use the control layer's
functions instead of manipulating the FD directly. This is safe since the
control layer is always present when done. Note that now we also adjust
the flag before calling the function to make things cleaner in case such
a layer would need to call the same functions again for any reason.
Currnetly conn_ctrl_init() does an fd_insert() and conn_ctrl_close() does an
fd_delete(). These are the two only short-term obstacles against using a
non-fd handle to set up a connection. Let's have pur these into the protocol
layer, along with the other connection-level stuff so that the generic
connection code uses them instead. This will allow to define new ones for
other protocols (e.g. QUIC).
Since we only support regular sockets at the moment, the code was placed
into sock.c and shared with proto_tcp, proto_uxst and proto_sockpair.
For the sake of an improved readability, let's group the protocol
field members according to where they're supposed to be defined:
- connection layer (note: for now even UDP needs one)
- binding layer
- address family
- socket layer
Nothing else was changed.
The various protocols were made static since there was no point in
exporting them in the past. Nowadays with QUIC relying on UDP we'll
significantly benefit from UDP being exported and more generally from
being able to declare some functions as being the same as other
protocols'.
In an ideal world it should not be these protocols which should be
exported, but the intermediary levels:
- socket layer (sock.c only right now), already exported as functions
but nothing structured at the moment ;
- family layer (sock_inet, sock_unix, sockpair etc): already structured
and exported
- binding layer (the part that relies on the receiver): currently fused
within the protocol
- connectiong layer (the part that manipulates connections): currently
fused within the protocol
- protocol (connection's control): shouldn't need to be exposed
ultimately once the elements above are in an easily sharable way.
This field used to be needed before commit 2b5e0d8b6 ("MEDIUM: proto_udp:
replace last AF_CUST_UDP* with AF_INET*") as it was used as a protocol
entry selector. Since this commit it's always equal to the socket family's
value so it's entirely redundant. Let's remove it now to simplify the
protocol definition a little bit.
The input buffer passed as argument to create a new stream must not be
transferred when the request channel is initialized because the channel
flags are not set at this stage. In addition, the API is a bit confusing
regarding the buffer owner when an error occurred. The caller remains the
owner, but reading the code it is not obvious.
So, first of all, to avoid any ambiguities, comments are added on the
calling chain to make it clear. The buffer owner is the caller if any error
occurred. And the ownership is transferred to the stream on success.
Then, to make things simple, the ownership is transferred at the end of
stream_new(), in case of success. And the input buffer is updated to point
on BUF_NULL. Thus, in all cases, if the caller try to release it calling
b_free() on it, it is not a problem. Of course, it remains the caller
responsibility to release it on error.
The patch fixes a bug introduced by the commit 26256f86e ("MINOR: stream:
Pass an optional input buffer when a stream is created"). No backport is
needed.
With the removal of the family-specific port setting, all protocol had
exactly the same implementation of ->add(). A generic one was created
with the name "default_add_listener" so that all other ones can now be
removed. The API was slightly adjusted so that the protocol and the
listener are passed instead of the listener and the port.
Note that all protocols continue to provide this ->add() method instead
of routinely calling default_add_listener() from create_listeners(). This
makes sure that any non-standard protocol will still be able to intercept
the listener addition if needed.
This could be backported to 2.3 along with the few previous patches on
listners as a pure code cleanup.
At various places we need to set a port on an IPv4 or IPv6 address, and
it requires casts that are easy to get wrong. Let's add a new set_port()
helper to the address family to assist in this. It will be directly
accessible from the protocol and will make the operation seamless.
Right now this is only implemented for sock_inet as other families do
not need a port.
This flags is now unused. It was used in REQ_WAIT_HTTP analyser, when a
stream was waiting for a request, to set the keep-alive timeout or to avoid
to send HTTP errors to client.
The ctl param MUX_EXIT_STATUS can be request to get the exit status of a
multiplexer. For instance, it may be an HTTP status code or an H2 error. For
now, 0 is always returned. When the mux h1 will be able to return HTTP
errors itself, this ctl param will be used to get the HTTP status code from
the logs.
the mux_exit_status enum has been created to map internal mux exist status
to generic one. Thus there is 5 possible status for now: success, invalid
error, timeout error, internal error and unknown.
cumulative numbers of http request and http errors of counters tracked at
the session level and their rates can now be updated at the session level
thanks to two new functions. These functions are not used for now, but it
will be called to keep tracked counters up-to-date if an error occurs before
the stream creation.
The cumulative numbers of http requests, http errors, bytes received and
sent and their respective rates for a tracked counters are now updated using
specific stream independent functions. These functions are used by the
stream but the aim is to allow the session to do so too. For now, there is
no reason to perform these updates from the session, except from the mux-h2
maybe. But, the mux-h1, on the frontend side, will be able to return some
errors to the client, before the stream creation. In this case, it will be
mandatory to update counters tracked at the session level.
It is now possible to set the buffer used by the channel request buffer when
a stream is created. It may be useful if input data are already received,
instead of waiting the first call to the mux rcv_buf() callback. This change
is mandatory to support H1 connection with no stream attached.
For now, the multiplexers don't pass any buffer. BUF_NULL is thus used to
call stream_create_from_cs().
This callback function was only defined by the mux-h1. But it has been
removed in the previous commit because it is unused now. So, we can do a
step forward removing the callback function from the mux definition and the
cs_info structure.
The idle duration between two streams is added to the session structure. It
is not necessarily pertinent on all protocols. In fact, it is only defined
for H1 connections. It is the duration between two H1 transactions. But the
.get_cs_info() callback function on the multiplexers only exists because
this duration is missing at the session level. So it is a simplification
opportunity for a really low cost.
To reduce the cost, a hole in the session structure is filled by moving
.srv_list field at the end of the structure.
The goal is to allow execution of one main lua state per thread.
The array introduces storage of one reference per thread, because each
lua state can have different reference id for a same function. A function
returns the preferred state id according to configuration and current
thread id.
The goal is to allow execution of one main lua state per thread.
"state_from" is a pointer to the parent lua state. "state_id"
is the index of the parent state id in the reference lua states
array. "state_id" is better because the lock is a "== 0" test
which is quick than pointer comparison. In other way, the state_id
index could index other things the the Lua state concerned. I
think to the function references.
The goal is to allow execution of one main lua state per thread.
This function will initialize the struct with other things than 0.
With this function helper, the initialization is centralized and
it prevents mistakes. This patch also keeps a reference to each
declared function in a list. It will be useful in next patches to
control consistency of declared references.
The goal is to allow execution of one main lua state per thread.
The function hlua_ctx_init() now gets the original lua state from
its caller. This allows the initialisation of lua_thread (coroutines)
from any master lua state.
The parent lua state is stored in the hlua struct.
This patch is a temporary transition, it will be modified later.
The goal is to allow execution of one main lua state per thread.
Because this struct will be filled after the configuration parser, we
cannot copy the content. The actual state of the Haproxy code doesn't
justify this change, it is an update preparing next steps.
This function will be useful to check if the keyword is already registered.
Also add a define for the max number of args.
This will be needed by a next patch to fix a bug and will have to be
backported.
This function simply calls action_lookup() on the private service_keywords,
to look up a service name. This will be used to detect double registration
of a same service from Lua.
This will be needed by a next patch to fix a bug and will have to be
backported.
These functions will be useful to check if a keyword is already registered.
This will be needed by a next patch to fix a bug, and will need to be
backported.
Sometimes it would be nice to be able to only trace abnormal events such
as protocol errors. Let's add a new "error" level below the "user" level
for this. This will allow to add TRACE_ERROR() at various error points
and only see them.
This patch adds a new logging variable '%HPO' for logging HTTP path only
(without query string) from relative or absolute URI.
For example:
log-format "hpo=%HPO hp=%HP hu=%HU hq=%HQ"
GET /r/1 HTTP/1.1
=>
hpo=/r/1 hp=/r/1 hu=/r/1 hq=
GET /r/2?q=2 HTTP/1.1
=>
hpo=/r/2 hp=/r/2 hu=/r/2?q=2 hq=?q=2
GET http://host/r/3 HTTP/1.1
=>
hpo=/r/3 hp=http://host/r/3 hu=http://host/r/3 hq=
GET http://host/r/4?q=4 HTTP/1.1
=>
hpo=/r/4 hp=http://host/r/4 hu=http://host/r/4?q=4 hq=?q=4
Since 2.3 default local log format always adds hostame field.
This behavior change was due to log/sink re-work, because according
to rfc3164 the hostname field is mandatory.
This patch re-introduce a legacy "local" format which is analog
to rfc3164 but with hostname stripped. This is the new
default if logs are generated by haproxy.
To stay compliant with previous configurations, the option
"log-send-hostname" acts as if the default format is switched
to rfc3164.
This patch addresses the github issue #963
This patch should be backported in branches >= 2.3.
In issue #958 Ashley Penney reported intermittent crashes on AWS's ARM
nodes which would not happen on x86 nodes. After investigation it turned
out that the Neoverse N1 CPU cores used in the Graviton2 CPU are much
more aggressive than the usual Cortex A53/A72/A55 or any x86 regarding
memory ordering.
The issue that was triggered there is that if a tasklet_wakeup() call
is made on a tasklet scheduled to run on a foreign thread and that
tasklet is just being dequeued to be processed, there can be a race at
two places:
- if MT_LIST_TRY_ADDQ() happens between MT_LIST_BEHEAD() and
LIST_SPLICE_END_DETACHED() if the tasklet is alone in the list,
because the emptiness tests matches ;
- if MT_LIST_TRY_ADDQ() happens during LIST_DEL_INIT() in
run_tasks_from_lists(), then depending on how LIST_DEL_INIT() ends
up being implemented, it may even corrupt the adjacent nodes while
they're being reused for the in-tree storage.
This issue was introduced in 2.2 when support for waking up remote
tasklets was added. Initially the attachment of a tasklet to a list
was enough to know its status and this used to be stable information.
Now it's not sufficient to rely on this anymore, thus we need to use
a different information.
This patch solves this by adding a new task flag, TASK_IN_LIST, which
is atomically set before attaching a tasklet to a list, and is only
removed after the tasklet is detached from a list. It is checked
by tasklet_wakeup_on() so that it may only be done while the tasklet
is out of any list, and is cleared during the state switch when calling
the tasklet. Note that the flag is not set for pure tasks as it's not
needed.
However this introduces a new special case: the function
tasklet_remove_from_tasklet_list() needs to keep both states in sync
and cannot check both the state and the attachment to a list at the
same time. This function is already limited to being used by the thread
owning the tasklet, so in this case the test remains reliable. However,
just like its predecessors, this function is wrong by design and it
should probably be replaced with a stricter one, a lazy one, or be
totally removed (it's only used in checks to avoid calling a possibly
scheduled event, and when freeing a tasklet). Regardless, for now the
function exists so the flag is removed only if the deletion could be
done, which covers all cases we're interested in regarding the insertion.
This removal is safe against a concurrent tasklet_wakeup_on() since
MT_LIST_DEL() guarantees the atomic test, and will ultimately clear
the flag only if the task could be deleted, so the flag will always
reflect the last state.
This should be carefully be backported as far as 2.2 after some
observation period. This patch depends on previous patch
"MINOR: task: remove __tasklet_remove_from_tasklet_list()".
This function is only used at a single place directly within the
scheduler in run_tasks_from_lists() and it really ought not be called
by anything else, regardless of what its comment says. Let's delete
it, move the two lines directly into the call place, and take this
opportunity to factor the atomic decrement on tasks_run_queue. A comment
was added on the remaining one tasklet_remove_from_tasklet_list() to
mention the risks in using it.
This function is only called at a single place and adds more confusion
than it removes. It also makes one think it could be used outside of
the scheduler while it must absolutely not. Let's just move its two
lines to the call place, making the code more readable there. In
addition this clearly shows that the preliminary LIST_INIT() is
useless since the entry is immediately overwritten.
In MT_LIST_TRY_ADDQ() and MT_LIST_TRY_ADD() we can't just check if the
element is already in a list, because there's a small race condition, it
could be added between the time we checked, and the time we actually set
its next and prev, so we have to lock it first.
This is required to address issue #958.
This should be backported to 2.3, 2.2 and 2.1.
As suggested by @AGSaidi in issue #958, on ARMv8 its convenient to use
an "isb" instruction in pl_cpu_relax() to improve fairness. Without it
I've met a few watchdog conditions on valid locks with 16 threads,
indicating that some threads couldn't manage to get it in 2 seconds. I
never happened again with it. In addition, the performance increased
by slightly more than 5% thanks to the reduced contention.
This should be backported as far as 2.2, possibly even 2.0.
The parsing of the check options based on tcp-check rules (redis, spop,
smtp, http...) are moved aways from check.c. Now, these functions are placed
in tcpcheck.c. These functions are only related to the tcpcheck ruleset
configured on a proxy and not to the health-check attached to a server.
This option is now ignored because I/O check buffers are now allocated using the
buffer pool. Thus, it is marked as deprecated in the documentation and ignored
during the configuration parsing. The field is also removed from the global
structure.
Because this option is ignored since a recent fix, backported as fare as 2.2,
this patch should be backported too. Especially because it updates the
documentation.
Historically, the input and output buffers of a check are allocated by hand
during the startup, with a specific size (not necessarily the same than
other buffers). But since the recent refactoring of the checks to rely
exclusively on the tcp-checks and to use the underlying mux layer, this part
is totally buggy. Indeed, because these buffers are now passed to a mux,
they maybe be swapped if a zero-copy is possible. In fact, for now it is
only possible in h2_rcv_buf(). Thus the bug concretely only exists if a h2
health-check is performed. But, it is a latent bug for other muxes.
Another problem is the size of these buffers. because it may differ for the
other buffer size, it might be source of bugs.
Finally, for configurations with hundreds of thousands of servers, having 2
buffers per check always allocated may be an issue.
To fix the bug, we now allocate these buffers when required using the buffer
pool. Thus not-running checks don't waste memory and muxes may swap them if
possible. The only drawback is the check buffers have now always the same
size than buffers used by the streams. This deprecates indirectly the
"tune.chksize" global option.
In addition, the http-check regtest have been update to perform some h2
health-checks.
Many thanks to @VigneshSP94 for its help on this bug.
This patch should solve the issue #936. It relies on the commit "MINOR:
tcpcheck: Don't handle anymore in-progress send rules in tcpcheck_main".
Both must be backport as far as 2.2.
bla
The Vary functionality is based on a secondary key that needs to be
calculated for every request to which a server answers with a Vary
header. The Vary header, which can only be found in server responses,
determines which headers of the request need to be taken into account in
the secondary key. Since we do not want to have to store all the headers
of the request until we have the response, we will pre-calculate as many
sub-hashes as there are headers that we want to manage in a Vary
context. We will only focus on a subset of headers which are likely to
be mentioned in a Vary response (accept-encoding and referer for now).
Every managed header will have its own normalization function which is
in charge of transforming the header value into a core representation,
more robust to insignificant changes that could exist between multiple
clients. For instance, two accept-encoding values mentioning the same
encodings but in different orders should give the same hash.
This patch adds a function that parses a Vary header value and checks if
all the values belong to our supported subset. It also adds the
normalization functions for our two headers, as well as utility
functions that can prebuild a secondary key for a given request and
transform it into an actual secondary key after the vary signature is
determined from the response.
When at least one data filter is registered on a channel, the offsets of all
filters must be kept up to date. For data filters but also for others. It is
safer to do it in that way. Indirectly, this patch fixes 2 hidden bugs
revealed by the commit 22fca1f2c ("BUG/MEDIUM: filters: Forward all filtered
data at the end of http filtering").
The first one, the worst of both, happens at the end of http filtering when
at least one data filtered is registered on the channel. We call the
http_end() callback function on the filters, when defined, to finish the
http filtering. But it is performed for all filters. Before the commit
22fca1f2c, the only risk was to call the http_end() callback function
unexpectedly on a filter. Now, we may have an overflow on the offset
variable, used at the end to forward all filtered data. Of course, from the
moment we forward an arbitrary huge amount of data, all kinds of bad things
may happen. So offset computation is performed for all filters and
http_end() callback function is called only for data filters.
The other one happens when a data filter alter the data of a channel, it
must update the offsets of all previous filters. But the offset of non-data
filters must be up to date, otherwise, here too we may have an integer
overflow.
Another way to fix these bugs is to always ignore non-data filters from the
offsets computation. But this patch is safer and probably easier to
maintain.
This patch must be backported in all versions where the above commit is. So
as far as 2.0.
Level-7 retries are only possible with a restricted number of HTTP
return codes. While it is usually not safe to retry on 401 and 403, I
came up with an authentication backend which was not synchronizing
authentication of users. While not perfect, being allowed to also retry
on those return codes is really helpful and acts as a hotfix until we
can fix the backend.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
This patch adds -m flag which allows to specify header name
matching method when deleting headers from http request/response.
Currently beg, end, sub, str and reg are supported.
This is related to GitHub issue #909
Baptiste reported a new crash affecting 2.3 which can be triggered
when using H2 on the backend, with http-reuse always and with a tens
of clients doing close only. There are a few combined cases which cause
this to happen, but each time the issue is the same, an already freed
session is dereferenced in session_unown_conn().
Two cases were identified to cause this:
- a connection referencing a session as its owner, which is detached
from the session's list and is destroyed after this session ends.
The test on conn->owner before calling session_unown_conn() is not
sufficent as the pointer is not null but is not valid anymore.
- a connection that never goes idle and that gets killed form the
mux, where session_free() is called first, then conn_free() calls
session_unown_conn() which scans the just freed session for older
connections. This one is only triggered with DEBUG_UAF
The reason for this session to be present here is that it's needed during
the connection setup, to be passed to conn_install_mux_be() to mux->init()
as the owning session, but it's never deleted aftrewards. Furthermore, even
conn_session_free() doesn't delete this pointer after freeing the session
that lies there. Both do definitely result in a use-after-free that's more
easily triggered under DEBUG_UAF.
This patch makes sure that the owner is always deleted after detaching
or killing the session. However it is currently not possible to clear
the owner right after a synchronous init because the proxy protocol
apparently needs it (a reg test checks this), and if we leave it past
the connection setup with the session not attached anywhere, it's hard
to catch the right moment to detach it. This means that the session may
remain in conn->owner as long as the connection has never been added to
nor removed from the session's idle list. Given that this patch needs to
remain simple enough to be backported, instead it adds a workaround in
session_unown_conn() to detect that the element is already not attached
anywhere.
This fix absolutely requires previous patch "CLEANUP: connection: do not
use conn->owner when the session is known" otherwise the situation will
be even worse, as some places used to rely on conn->owner instead of the
session.
The fix could theorically be backported as far as 1.8. However, the code
in this area has significantly changed along versions and there are more
risks of breaking working stuff than fixing real issues there. The issue
was really woken up in two steps during 2.3-dev when slightly reworking
the idle conns with commit 08016ab82 ("MEDIUM: connection: Add private
connections synchronously in session server list") and when adding
support for storing used H2 connections in the session and adding the
necessary call to session_unown_conn() in the muxes. But the same test
managed to crash 2.2 when built in DEBUG_UAF and patched like this,
proving that we used to already leave dangling pointers behind us:
| diff --git a/include/haproxy/connection.h b/include/haproxy/connection.h
| index f8f235c1a..dd30b5f80 100644
| --- a/include/haproxy/connection.h
| +++ b/include/haproxy/connection.h
| @@ -458,6 +458,10 @@ static inline void conn_free(struct connection *conn)
| sess->idle_conns--;
| session_unown_conn(sess, conn);
| }
| + else {
| + struct session *sess = conn->owner;
| + BUG_ON(sess && sess->origin != &conn->obj_type);
| + }
|
| sockaddr_free(&conn->src);
| sockaddr_free(&conn->dst);
It's uncertain whether an existing code path there can lead to dereferencing
conn->owner when it's bad, though certain suspicious memory corruption bugs
make one think it's a likely candidate. The patch should not be hard to
adapt there.
Backports to 2.1 and older are left to the appreciation of the person
doing the backport.
A reproducer consists in this:
global
nbthread 1
listen l
bind :9000
mode http
http-reuse always
server s 127.0.0.1:8999 proto h2
frontend f
bind :8999 proto h2
mode http
http-request return status 200
Then this will make it crash within 2-3 seconds:
$ h1load -e -r 1 -c 10 http://0:9000/
If it does not, it might be that DEBUG_UAF was not used (it's harder then)
and it might be useful to restart.
At a few places we used to rely on conn->owner to retrieve the session
while the session is already known. This is not correct because at some
of these points the reason the connection's owner was still the session
(instead of NULL) is a mistake. At one place a comparison is even made
between the session and conn->owner assuming it's valid without checking
if it's NULL. Let's clean this up to use the session all the time.
Note that this will be needed for a forthcoming fix and will have to be
backported.
HAVE_SSL_CTX_SET_CIPHERSUITES is newly defined macro set in openssl-compat.h,
which helps to identify ssl libs (currently OpenSSL-1.1.1 only) that supports
TLS13 cipersuites manipulation on TLS13 context
in the context of a progressive backend migration, we want to be able to
activate SSL on outgoing connections to the server at runtime without
reloading.
This patch adds a `set server ssl` command; in order to allow that:
- add `srv_use_ssl` to `show servers state` command for compatibility,
also update associated parsing
- when using default-server ssl setting, and `no-ssl` on server line,
init SSL ctx without activating it
- when triggering ssl API, de/activate SSL connections as requested
- clean ongoing connections as it is done for addr/port changes, without
checking prior server state
example config:
backend be_foo
default-server ssl
server srv0 127.0.0.1:6011 weight 1 no-ssl
show servers state:
5 be_foo 1 srv0 127.0.0.1 2 0 1 1 15 1 0 4 0 0 0 0 - 6011 - -1
where srv0 can switch to ssl later during the runtime:
set server be_foo/srv0 ssl on
5 be_foo 1 srv0 127.0.0.1 2 0 1 1 15 1 0 4 0 0 0 0 - 6011 - 1
Also update existing tests and create a new one.
Signed-off-by: William Dauchy <wdauchy@gmail.com>
Functions registered to release memory per-thread have no return value. But the
registering function and the function pointer in per_thread_free_fct structure
specify it should return an integer. This patch fixes it.
This patch may be backported as far as 2.0.
Define a per-thread counters allocated with the greatest size of any
stat module counters. This variable is named trash_counters.
When using a proxy without allocated counters, return the trash counters
from EXTRA_COUNTERS_GET instead of a dangling pointer to prevent
segfault.
This is useful for all the proxies used internally and not
belonging to the global proxy list. As these objects does not appears on
the stat report, it does not matter to use the dummy counters.
For this fix to be functional, the extra counters are explicitly
initialized to NULL on proxy/server/listener init functions.
Most notably, the crash has already been detected with the following
vtc:
- reg-tests/lua/txn_get_priv.vtc
- reg-tests/peers/tls_basic_sync.vtc
- reg-tests/peers/tls_basic_sync_wo_stkt_backend.vtc
There is probably other parts that may be impacted (SPOE for example).
This bug was introduced in the current release and do not need to be
backported. The faulty commits are
"MINOR: ssl: count client hello for stats" and
"MINOR: ssl: add counters for ssl sessions".
Do not cache responses that do not have an explicit expiration time
(s-maxage or max-age Cache-Control directives or Expires header) or a
validator (ETag or Last-Modified headers) anymore, as suggested in
RFC 7234#3.
The TX_FLAG_IGNORE flag is used instead of the TX_FLAG_CACHEABLE so as
not to change the behavior of the checkcache option.
No details are provided when an error occurs during the parsing of an errorfile,
Thus it is a bit hard to diagnose where the problem is. Now, when it happens, an
understandable error message is reported.
This patch is not a bug fix in itself. But it will be required to change an
fatal error into a warning in last stable releases. Thus it must be backported
as far as 2.0.
It does not require heavy deletion from the expr anymore, so we can now
turn this to a single-linked list since most of the time we want to delete
all instances of a given pattern from the head. By doing so we save 32 bytes
of memory per pattern. The pat_unlink_from_head() function was adjusted
accordingly.
This function will be usable to purge at most a specified number of old
entries from a reference. Entries are declared old if their generation
number is in the past compared to the one passed in argument. This will
ease removal of early entries when new ones have been appended.
We also call malloc_trim() when available, at the end of the series,
because this is one place where there is a lot of memory to save. Reloads
of 1M IP addresses used in an ACL made the process grow up to 1.7 GB RSS
after 10 reloads and roughly stabilize there without this call, versus
only 260 MB when the call is present. Sadly there is no direct equivalent
for jemalloc, which stabilizes around 800MB-1GB.
pat_ref_load() basically combines pat_ref_append() and pat_ref_commit().
It's very similar to pat_ref_add() except that it also allows to set the
generation ID and the line number. pat_ref_add() was modified to directly
rely on it to avoid code duplication. Note that a previous declaration
of pat_ref_load() was removed as it was just a leftover of an earlier
incarnation of something possibly similar, so no existing functionality
was changed here.
This function will be used after a successful pat_ref_append() to propagate
the pattern to all use places (including parsing and indexing). On failure,
it will entirely roll back all insertions and free the pattern itself. It
also preserves the generation number so that it is convenient for use in
association with pat_ref_append(). pat_ref_add() was modified to rely on
it instead of open-coding the insertion and roll-back.
Right now it's not possible to perform a safe reload because we don't
know what patterns were recently added or were already present. This
patch adds a generation counter to the reference patterns so that it
is possible to know what generation of the reference they were loaded
with. A reference now has two generations, the current one, used for
all additions, and the next one, allocated to those wishing to update
the contents. The generation wraps at 2^32 so comparisons must be made
relative to the current position.
The idea will be that upon full reload, the caller will first get a new
generation ID, will insert all new patterns using it, will then switch
the current ID to the new one, and will delete all entries older than
the current ID. This has the benefit of supporting chunked updates that
remain consistent and that won't block the whole process for ages like
pat_ref_reload() currently does.
Till now the only way to remove a known reference was via
pat_ref_delete_by_id() which scans the whole list to find a matching pointer.
Let's add pat_ref_delete_by_ptr() which takes a valid pointer. It can be
called by the function above after the pointer is found, and can also be
used to roll back a failed insertion much more efficiently.
These ones are not used anymore, so let's remove them to remove a bit
of the complexity. The ACL keyword's delete() function could be removed
as well, though most keyword declarations are positional and we have a
high risk of introducing a mistake here, so let's not touch the ACL part.
pat_del_tree_gen() was already chained onto pat_del_list_gen() to deal
with remaining cases, so let's complete the merge and have a generic
pattern deletion function acting on the reference and taking care of
reliably removing all elements.
This is the next step in speeding up entry removal. Now we don't scan
the whole lists or trees for elements pointing to the target reference,
instead we start from the reference and delete all linked patterns.
This simplifies some delete functions since we don't need anymore to
delete multiple times from an expression since all nodes appear after
the reference element. We can now have one generic list and one generic
tree deletion function.
This required the replacement of pattern_delete() with an open-coded
version since we now need to lock all expressions first before proceeding.
This means there is a high risk of lock inversion here but given that the
expressions are always scanned in the same order from the same head, this
must not happen.
Now deleting first entries is instantaneous, and it's still slow to
delete the last ones when looking up their ID since it still requires
to look them up by a full scan, but it's already way faster than
previously. Typically removing the last 10 IP from a 20M entries ACL
with a full-scan each took less than 2 seconds.
It would be technically possible to make use of indexed entries to
speed up most lookups for removal by value (e.g. IP addresses) but
that's for later.
There is a data model issue in the current pattern design that makes
pattern deletion extremely expensive: there's no direct way from a
reference to access all indexed occurrences. As such, the only way
to remove all indexed entries corresponding to a reference update
is to scan all expressions's lists and trees to find a link to the
reference. While this was possibly OK when map removal was not
common and most maps were small, this is not conceivable anymore
with GeoIP maps containing 10M+ entries and del-map operations that
are triggered from http-request rulesets.
This patch introduces two list heads from the pattern reference, one
for the objects linked by lists and one for those linked by tree node.
Ideally a single list would be enough but the linked elements are too
much unrelated to be distinguished at the moment, so we'll need two
lists. However for the long term a single-linked list will suffice but
for now it's not possible due to the way elements are removed from
expressions. As such this patch adds 32 bytes of memory usage per
reference plus 16 per indexed entry, but both will be cut in half
later.
The links are not yet used for deletion, this patch only ensures the
list is always consistent.
Now we have a single prune() function to act on an expression, and one
delete function for the lists and one for the trees. The presence of a
pointer in the lists is enough to warrant a free, and we rely on the
PAT_SF_REGFREE flag to decide whether to free using free() or regfree().
Currently we have no way to know how to delete/prune a pattern in a
generic way. A pattern doesn't contain its own type so we don't know
what function to call. Tree nodes are roughly OK but not lists where
regex are possible. Let's add one new bit for sflags at index time to
indicate that regex_free() will be needed upon deletion. It's not used
for now.
It's not possible to uniquely update a single expression without updating
the pattern reference, I don't know why we've put the revision in the
expression back then, given that it in fact provides an update for a
full pattern. Let's move the revision into the reference's head instead.
This is in order to access malloc_trim() which is convenient after
clearing huge maps to reclaim memory. When this is detected, we also
define HA_HAVE_MALLOC_TRIM.
This patch implements a couple of converters to validate and extract data from a
MQTT (Message Queuing Telemetry Transport) message. The validation consists of a
few checks as well as "packet size" validation. The extraction can get any field
from the variable header and the payload.
This is limited to CONNECT and CONNACK packet types only. All other messages are
considered as invalid. It is not a problem for now because only the first packet
on each side can be parsed (CONNECT for the client and CONNACK for the server).
MQTT 3.1.1 and 5.0 are supported.
Reviewed and Fixed by Christopher Faulet <cfaulet@haproxy.com>
This patch implements a couple of converters to validate and extract tag value
from a FIX (Financial Information eXchange) message. The validation consists in
a few checks such as mandatory fields and checksum computation. The extraction
can get any tag value based on a tag string or tag id.
This patch requires the istend() function. Thus it depends on "MINOR: ist: Add
istend() function to return a pointer to the end of the string".
Reviewed and Fixed by Christopher Faulet <cfaulet@haproxy.com>
Released version 2.4-dev0 with the following main changes :
- MINOR: version: it's development again.
- DOC: mention in INSTALL that it's development again
Let us use SSL_CTX_set1_curves_list, defined by OpenSSL, as well as in
openssl-compat when SSL_CTRL_SET_CURVES_LIST is present (BoringSSL),
for feature detection instead of versions.
OpenSSL 1.0.2 and onwards define SSL_CTX_set1_curves_list which is both a
function and a macro. OpenSSL 1.0.2 to 1.1.0 define SSL_CTRL_SET_CURVES_LIST
as a macro, which disappeared from 1.1.1. BoringSSL only has that one and
not the former macro but it does have the function. Let's keep the test on
the macro matching the function name by defining the macro to itself when
needed.
This does the same as fd_want_recv() except that it does check for
fd_updt[] to be allocated, as this may be called during early listener
initialization. Previously we used to check fd_updt[] before calling
fd_want_recv() but this is not correct since it does not update the
FD flags. This method will be safer.
The mem stats are pretty convenient to spot leaks, except that they count
free(NULL) as 1, and the code does actually have quite a number of free(foo)
guards where foo is NULL if the object was already freed. Let's just not
count these ones so that the stats remain consistent. Now it's possible
to compare the strdup()/malloc() and free() and verify they are consistent.
It's more convenient to return the element than to return just 0 or 1,
as the next thing we'll want to do is to act on this element! In addition
it was using variable arguments instead of consts, causing some reuse
constraints which were also addressed. This doesn't change its use as
a boolean, hence why call places were not modified.
The "weight" column on the stats page is somewhat confusing when using
slowstart becaue it reports the effective weight, without being really
explicit about it. In some situations the user-configured weight is more
relevant (especially with long slowstarts where it's important to know
if the configured weight is correct).
This adds a new uweight stat which reports a server's user-configured
weight, and in a backend it receives the sum of all servers' uweights.
In addition it adds the mention of "effective" in a few descriptions
for the "weight" column (help and doc).
As a result, the list of servers in a backend is now always scanned
when dumping the stats. But this is not a problem given that these
servers are already scanned anyway and for way heavier processing.
"no-maint" is a bit similar to "up" except that it will only hide
servers that are in maintenance (or disabled in the configuration), and
not those that are enabled but failed a check. One benefit here is to
significantly reduce the output of the "show stat" command when using
large server-templates containing entries that are not yet provisioned.
Note that the prometheus exporter also has such an option which does
the exact same.
This reverts commit b7ba1d9011. Actually
this test had already been removed in the past by commit fac0f645d
("BUG/MEDIUM: queue: make pendconn_cond_unlink() really thread-safe"),
but the condition to reproduce the bug mentioned there was not clear.
Now after analysis and a certain dose of code cleanup, things start to
appear more obvious. what happens is that if we check the presence of
the node in the tree without taking the lock, we can see the NULL at
the instant the node is being unlinked by another thread in
pendconn_process_next_strm() as part of __pendconn_unlink_prx() or
__pendconn_unlink_srv(). Till now there is no issue except that the
pendconn is not removed from the queue during this operation and that
the task is scheduled to be woken up by pendconn_process_next_strm()
with the stream being added to the list of the server's active
connections by __stream_add_srv_conn(). The first thread finishes
faster and gets back to stream_free() faster than the second one
sets the srv_conn on the stream, so stream_free() skips the s->srv_conn
test and doesn't try to dequeue the freshly queued entry. At the
very least a barrier would be needed there but we can't afford to
free the stream while it's being queued. So there's no other solution
than making sure that either __pendconn_unlink_prx() or
pendconn_cond_unlink() get the entry but never both, which is why the
lock is required around the test. A possible solution would be to set
p->target before unlinking the entry and using it to complete the test.
This would leave no dead period where the pendconn is not seen as
attached.
It is possible, yet extremely difficult, to reproduce this bug, which
was first noticed in bug #880. Running 100 servers with maxconn 1 and
maxqueue 1 on leastconn and a connect timeout of 30ms under 16 threads
with DEBUG_UAF, with a traffic making the backend's queue oscillate
around zero (typically using 250 connections with a local httpterm
server) may rarely manage to trigger a use-after-free.
No backport is needed.
On connection error processing, we can see massive storms of calls to
pendconn_cond_unlink() to release a possible place in the queue. For
example, in issue #908, on average half of the threads are caught in
this function via back_try_conn_req() consecutive to a synchronous
error. However we wait until grabbing the lock to know if the pendconn
is effectively in a queue, which is expensive for many cases. We know
the transition may only happen from in-queue to out-of-queue so it's safe
to first run a preliminary check to see if it's worth going further. This
will allow to avoid the cost of locking for most requests. This should
not change anything for those completing correctly as they're already
run through pendconn_free() which doesn't call pendconn_cond_unlink()
unless deemed necessary.
This is an anticipation of finer grained locking for the queues. For now
all lock places take a write lock so that there is no difference at all
with previous code.
In addition to the previous simplification, most locks don't use the
seek or read lock (e.g. spinlocks etc) so let's split the dump into
distinct operations (write/seek/read) and only report those which
were used. Now the output size is roughly divided by 5 compared
to previous ones.
The lock stats are very verbose and more than half of them are used in
a typical test, making it hard to spot the sought values. Let's simply
report "not used" for those which have not been called at all.
When an internal response is returned to a client, the message payload must be
skipped if it is a reply to a HEAD request. The payload is removed from the HTX
message just before the message forwarding.
This bugs has been around for a long time. It was already there in the pre-HTX
versions. In legacy HTTP mode, internal errors are not parsed. So this bug
cannot be easily fixed. Thus, this patch should only be backported in all HTX
versions, as far as 2.0. However, the code has significantly changed in the
2.2. Thus in the 2.1 and 2.0, the patch must be entirely reworked.
Partial support of conditional HTTP requests. This commit adds the
support of the 'If-None-Match' header (see RFC 7232#3.2).
When a client specifies a list of ETags through one or more
'If-None-Match' headers, they are all compared to the one that might have
been stored in the corresponding http cache entry until one of them
matches.
If a match happens, a specific "304 Not Modified" response is
sent instead of the cached data. This response has all the stored
headers but no other data (see RFC 7232#4.1). Otherwise, the whole cached data
is sent.
Although unlikely in a GET/HEAD request, the "If-None-Match: *" syntax is
valid and also receives a "304 Not Modified" response (RFC 7434#4.3.2).
This resolves a part of GitHub issue #821.
Add a function that compares two etags that might be of different types.
If any of them is weak, the 'W/' prefix is discarded and a strict string
comparison is performed.
Co-authored-by: Tim Duesterhus <tim@bastelstu.be>
In issue #785, users are reporting that it's not convenient to load a
".crt.key" when the configuration contains a ".crt".
This option allows to remove the extension of the certificate before
trying to load any extra SSL file (.key, .ocsp, .sctl, .issuer etc.)
The patch changes a little bit the way ssl_sock_load_files_into_ckch()
looks for the file.
When a connection is released, depending on its state, it may be detached from
the session and it may be removed from the server lists. The first case may
happen for private or unsharable active connections. The second one should only
be performed for idle or available connections. We never try to remove a
connection from the server list if it is attached to a session. But it is also
important to never try to remove a private connecion from the server lists, even
if it is not attached to a session. Otherwise, the curr_used_conn server counter
is decremented once too often.
This bug was introduced by the commit 04a24c5ea ("MINOR: connection: don't check
priv flag on free"). It is related to the issue #881. It only affects the 2.3,
no backport is needed.
This counter is only updated and never used, and in addition it's done
without any atomicity so it's very unlikely to be correct on multi-CPU
systems! Let's just remove it since it's not used.
A few structures were slightly rearranged in order to plug some holes
left around the locks. Sizes ranging from 8 to 32 bytes could be saved
depending on the structures. No performance difference was noticed (none
was expected there), though memory usage might be slightly reduced in
some rare cases.
We don't need to waste the size of a long for the locks: with the plocks,
even an unsigned short would offer enough room for up to 126 threads! Let's
use an unsigned int which will be easier to place in certain structures
and will more conveniently plug some holes, and Atomic ops are at least
as fast on 32-bit as on 64-bit. This will not change anything for 32-bit
platforms.
It's a bit overkill to register an initcall to call a function to set
a lock to zero when not debugging, let's just declare the lock as
pre-initialized to zero.
BoringSSL is a fork of OpenSSL 1.1.0, however in
49e9f67d8b7cbeb3953b5548ad1009d15947a523 it has changed version to 1.1.1.
Should fix issue #895.
This must be backported to 2.2, 2.1, 2.0, 1.8
It was previously a spinlock, and it happens that a number of LB algos
only lock it for lookups, without performing any modification. Let's
first turn it to an rwlock and w-lock it everywhere. This is strictly
identical.
It was carefully checked that every HA_SPIN_LOCK() was turned to
HA_RWLOCK_WRLOCK() and that HA_SPIN_UNLOCK() was turned to
HA_RWLOCK_WRUNLOCK() on this lock. _INIT and _DESTROY were updated too.
Since our locks are based on progressive locks, we support the upgradable
seek lock that is compatible with readers and upgradable to a write lock.
The main purpose is to take it while seeking down a tree for modification
while other threads may seek the same tree for an input (e.g. compute the
next event date).
The newly supported operations are:
HA_RWLOCK_SKLOCK(lbl,l) pl_take_s(l) /* N --> S */
HA_RWLOCK_SKTOWR(lbl,l) pl_stow(l) /* S --> W */
HA_RWLOCK_WRTOSK(lbl,l) pl_wtos(l) /* W --> S */
HA_RWLOCK_SKTORD(lbl,l) pl_stor(l) /* S --> R */
HA_RWLOCK_WRTORD(lbl,l) pl_wtor(l) /* W --> R */
HA_RWLOCK_SKUNLOCK(lbl,l) pl_drop_s(l) /* S --> N */
HA_RWLOCK_TRYSKLOCK(lbl,l) (!pl_try_s(l)) /* N -?> S */
HA_RWLOCK_TRYRDTOSK(lbl,l) (!pl_try_rtos(l)) /* R -?> S */
Existing code paths are left unaffected so this patch doesn't affect
any running code.
We currently use only read and write lock operations with rwlocks, but
ours also support upgradable seek locks for which we do not report any
stats. Let's add them now when DEBUG_THREAD is enabled.
We don't need to specify the handler anymore since it's set in the
receiver. Let's remove this argument from the function and clean up
the remains of code that were still setting it.
Now we define a new sock_accept_iocb() for socket-based stream protocols
and use it as a wrapper for listener_accept() which now takes a listener
and not an FD anymore. This will allow the receiver's I/O cb to be
redefined during registration, and more specifically to get rid of the
hard-coded hacks in protocol_bind_all() made for syslog.
The previous ->accept() callback in the protocol was removed since it
doesn't have anything to do with accept() anymore but is more generic.
A few places where listener_accept() was compared against the FD's IO
callback for debugging purposes on the CLI were updated.
For now we're still using the protocol's default accept() function as
the I/O callback registered by the receiver into the poller. While
this is usable for most TCP connections where a listener is needed,
this is not suitable for UDP where a different handler is needed.
Let's make this configurable in the receiver just like the upper layer
is configurable for listeners. In order to ease stream protocols
handling, the protocols will now provide a default I/O callback
which will be preset into the receivers upon allocation so that
almost none of them has to deal with it.
The socket-specific accept() code in listener_accept() has nothing to
do there. Let's move it to sock.c where it can be significantly cleaned
up. It will now directly return an accepted connection and provide a
status code instead of letting listener_accept() deal with various errno
values. Note that this doesn't support the sockpair specific code.
The function is now responsible for dealing with its own receiver's
polling state and calling fd_cant_recv() when facing EAGAIN.
One tiny change from the previous implementation is that the connection's
sockaddr is now allocated before trying accept(), which saves a memcpy()
of the resulting address for each accept at the expense of a cheap
pool_alloc/pool_free on the final accept returning EAGAIN. This still
apparently slightly improves accept performance in microbencharks.
This per-protocol function will be used to accept an incoming
connection and return it as a struct connection*. As such the protocol
stack's internal representation of a connection will not need to be
handled by the listener code.
This call was introduced by commit 5ced3e887 ("MINOR: sock: add
sock_accept_conn() to test a listening socket") but is actually quite
confusing because it makes one think the socket will accept a connection
(which is what we want to have in a new function) while it only tells
whether it's configured to accept connections. Let's call it
sock_accepting_conn() instead.
The same change was applied to sockpair which had the same issue.
accept_conn() will be used to accept an incoming connection and return it.
It will have to deal with various error codes. The currently identified
ones were created as CO_AC_*.
Till now we would keep a per-thread queue of pending incoming connections
for which we would store:
- the listener
- the accepted FD
- the source address
- the source address' length
And these elements were first used in session_accept_fd() running on the
target thread to allocate a connection and duplicate them again. Doing
this induces various problems. The first one is that session_accept_fd()
may only run on file descriptors and cannot be reused for QUIC. The second
issue is that it induces lots of memory copies and that the listerner
queue thrashes a lot of cache, consuming 64 bytes per entry.
This patch changes this by allocating the connection before queueing it,
and by only placing the connection's pointer into the queue. Indeed, the
first two calls used to initialize the connection already store all the
information above, which can be retrieved from the connection pointer
alone. So we just have to pop one pointer from the target thread, and
pass it to session_accept_fd() which only needs the FD for the final
settings.
This starts to make the accept path a bit more transport-agnostic, and
saves memory and CPU cycles at the same time (1% connection rate increase
was noticed with 4 threads). Thanks to dividing the accept-queue entry
size from 64 to 8 bytes, its size could be increased from 256 to 1024
connections while still dividing the overall size by two. No single
queue full condition was met.
One minor drawback is that connection may be allocated from one thread's
pool to be used into another one. But this already happens a lot with
connection reuse so there is really nothing new here.
Roughly half of the calls to sockadr_alloc() are made to copy an already
known address. Let's optionally pass it in argument so that the function
can handle the copy at the same time, this slightly simplifies its usage.
fd_done_recv() used to be useful with the FD cache because it used to
allow to keep a file descriptor active in the poller without being
marked as ready in the cache, saving it from ringing immediately,
without incurring any system call. It was a way to make it yield
to wait for new events leaving a bit of time for others. The only
user left was the connection accepter (listen_accept()). We used
to suspect that with the FD cache removal it had become totally
useless since changing its readiness or not wouldn't change its
status regarding the poller itself, which would be the only one
deciding to report it again.
Careful tests showed that it indeed has exactly zero effect nowadays,
the syscall numbers are exactly the same with and without, including
when enabling edge-triggered polling.
Given that there's no more API available to manipulate it and that it
was directly called as an optimization from listener_accept(), it's
about time to remove it.
No protocol defines it anymore. The last user used to be the monitor-net
stuff that got partially broken already when the tcp_drain() function
moved to conn_sock_drain() with commit e215bba95 ("MINOR: connection:
make conn_sock_drain() work for all socket families") in 1.9-dev2.
A part of this will surely move back later when non-socket connections
arrive with QUIC but better keep the API clean and implement what's
needed in time instead.
As discussed here during 2.1-dev, "monitor-net" is totally obsolete:
https://www.mail-archive.com/haproxy@formilux.org/msg35204.html
It's fundamentally incompatible with usage of SSL, and imposes the
presence of file descriptors with hard-coded syscalls directly in the
generic accept path.
It's very unlikely that anyone has used it in the last 10 years for
anything beyond testing. In the worst case if anyone would depend
on it, replacing it with "http-request return status 200 if ..." and
"mode http" would certainly do the trick.
The keyword is still detected as special by the config parser to help
users update their configurations appropriately.
As discussed here during 2.1-dev, "mode health" is totally obsolete:
https://www.mail-archive.com/haproxy@formilux.org/msg35204.html
It's fundamentally incompatible with usage of SSL, doesn't support
source filtering, and imposes the presence of file descriptors with
hard-coded syscalls directly in the generic accept path.
It's very unlikely that anyone has used it in the last 10 years for
anything beyond testing. In the worst case if anyone would depend
on it, replacing it with "http-request return status 200" and "mode
http" would certainly do the trick.
The keyword is still detected as special by the config parser to help
users update their configurations appropriately.
Do not check CO_FL_PRIVATE flag to check if the connection is in session
list on conn_free. This is necessary due to the future patches which add
server connections in the session list even if not private, if the mux
protocol is the subject of HOL blocking.
On server connection migration from one thread to another, the wrong
idle thread-specific counter is decremented. This bug was introduced
since commit 3d52f0f1f8 due to the
factorization with srv_use_idle_conn. However, this statement is only
executed from conn_backend_get. Extract the decrement from
srv_use_idle_conn in conn_backend_get and use the correct
thread-specific counter.
Rename the function to srv_use_conn to better reflect its purpose as it
is also used with a newly initialized connection not in the idle list.
As a side change, the connection insertion to available list has also
been extracted to conn_backend_get. This will be useful to be able to
specify an alternative list for protocol subject to HOL risk that should
not be shared between several clients.
This bug is only present in this release and thus do not need a backport.
Now we introdce a new .rx_listening() function to report if a receiver is
actually a listening socket. The reason for this is to help detect shared
sockets that might have been broken by sibling processes.
At several places we need to check if a socket is still valid and still
willing to accept connections. Instead of open-coding this, each time,
let's add a new function for this.
This patch adds "coll" new counter and the heartbeat timer values to "show peers"
command. It also adds the elapsed time since the last handshake to new "last_hdshk"
new peer dump field.
Now we have ->suspend() and ->resume() for listeners at the protocol
level. This means that it now becomes possible for a protocol to redefine
its own way to suspend and resume. The default functions are provided for
TCP, UDP and unix, and they are pass-through to the receiver equivalent
as it used to be till now. Nothing was defined for sockpair since it does
not need to suspend/resume during reloads, hence it will succeed.
The inner part now goes into the protocol and is used to decide how to
unbind a given protocol's listener. The existing code which is able to
also unbind the receiver was provided as a default function that we
currently use everywhere. Some complex listeners like QUIC will use this
to decide how to unbind without impacting existing connections, possibly
by setting up other incoming paths for the traffic.
This is used as a generic way to unbind a receiver at the end of
do_unbind_listener(). This allows to considerably simplify that function
since we can now let the protocol perform the cleanup. The generic code
was moved to sock.c, along with the conditional rx_disable() call. Now
the code also supports that the ->disable() function of the protocol
which acts on the listener performs the close itself and adjusts the
RX_F_BUOND flag accordingly.
This listener flag indicates whether the receiver part of the listener
is specific to the master or to the workers. In practice it's only used
by the master's CLI right now. It's used to know whether or not the FD
must be closed before forking the workers. For this reason it's way more
of a receiver's property than a listener's property, so let's move it
there under the name RX_F_MWORKER. The rest of the code remains
unchanged.
And also remove it from its callers. This subtle distinction was added as
sort of a hack for the seamless reload feature but is not needed anymore
since the do_close turned unused since commit previous commit ("MEDIUM:
listener: let do_unbind_listener() decide whether to close or not").
This also removes the unbind_listener_no_close() function.
This will instantly stop all listeners except those which belong to
a proxy configured with a grace time. This means that UDP listeners,
and peers will also be stopped when called this way.
There are multiple ways a proxy may switch to the disabled state,
but now it's essentially once it loses its last listener. Instead
of keeping duplicate code around and reporting the state change
before actually seeing it, we now report it at the moment it's
performed (from the last listener leaving) which allows to remove
the message from all other places.
For now we cannot easily distinguish a peers frontend from another one,
which will be problematic to avoid reporting them when stopping their
listeners. Let's add PR_MODE_PEERS for this. It's not supposed to cause
any issue since all non-HTTP proxies are handled similarly now.
This function will be used to definitely stop a listener (e.g. during a
soft_stop). This is actually tricky because it may be called for a proxy
or for a protocol, both of which require locks and already hold some. The
function takes booleans indicating which ones are already held, hoping
this will be enough. It's not well defined wether proto->disable() and
proto->rx_disable() are supposed to be called with any lock held, and
they are used from do_unbind_listener() with all these locks. Some back
annotations ought to be added on this point.
The proxy's listeners count is updated, and the proxy is marked as
disabled and woken up after the last one is gone. Note that a
listener in listen state is already not attached anymore since it
was disabled.
We'll need an already locked variant of this function so let's make
__delete_listener() which will be called with the protocol lock held
and the listener's lock held.
These methods will be used to enable/disable accepting new connections
so that listeners do not play with FD directly anymore. Since all the
currently supported protocols work on socket for now, these are identical
to the rx_enable/rx_disable functions. However they were not defined in
sock.c since it's likely that some will quickly start to differ. At the
moment they're not used.
We have to take care of fd_updt before calling fd_{want,stop}_recv()
because it's allocated fairly late in the boot process and some such
functions may be called very early (e.g. to stop a disabled frontend's
listeners).
These methods will be used to enable/disable rx at the receiver level so
that callers don't play with FDs directly anymore. All our protocols use
the generic ones from sock.c at the moment. For now they're not used.
These will be used on receivers, to enable or disable receiving on a
listener, which most of the time just consists in enabling/disabling
the file descriptor.
We have to take care of the existence of fd_updt to know if we may
or not call fd_{want,stop}_recv() since it's not permitted in very
early boot.
This one undoes ->rx_suspend(), it tries to restore an operational socket.
It was only implemented for TCP since it's the only one we support right
now.
The ->pause method is inappropriate since it doesn't exactly "pause" a
listener but rather temporarily disables it so that it's not visible at
all to let another process take its place. The term "suspend" is more
suitable, since the "pause" is actually what we'll need to apply to the
FULL and LIMITED states which really need to make a pause in the accept
process. And it goes well with the use of the "resume" function that
will also need to be made per-protocol.
Let's rename the function and make it act on the receiver since it's
already what it essentially does, hence the prefix "_rx" to make it
more explicit.
The protocol struct was a bit reordered because it was becoming a real
mess between the parts related to the listeners and those for the
receivers.
Since the listeners were split into receiver+listener, this field ought
to have been renamed because it's confusing. It really links receivers
and not listeners, as most of the time it's used via rx.proto_list!
The nb_listeners field was updated accordingly.
These ones have never been called, they were referenced by the protocol's
disable_all for some protocols but there are no traces of their use, so
in addition to not being sure the code works, it has never been tested.
Let's remove a bit of complexity starting from there.
These two functions are used to pause and resume all listeners of
all protocols. They use the standard listener functions for this
so they're supposed to handle the situation gracefully regardless
of the upper proxies' states, and they will report completion on
proxies once the switch is performed.
It might be nice to define a particular "failed" state for listeners
that cannot resume and to count them on proxies in order to mention
that they're definitely stuck. On the other hand, the current
situation is retryable which is quite appreciable as well.
The two functions don't need to be distinguished anymore since they have
all the necessary info to act as needed on their listeners. Let's just
pass via stop_proxy() and make it check for each listener which one to
close or not.
Its sole remaining purpose was to display "proxy foo started", which
has little benefit and pollutes output for those with plenty of proxies.
Let's remove it now.
The VTCs were updated to reflect this, because many of them had explicit
counts of dropped lines to match this message.
This is tagged as MEDIUM because some users may be surprized by the
loss of this quite old message.
The remaining proxy states were only used to distinguish an enabled
proxy from a disabled one. Due to the initialization order, both
PR_STNEW and PR_STREADY were equivalent after startup, and they
would only differ from PR_STSTOPPED when the proxy is disabled or
shutdown (which is effectively another way to disable it).
Now we just have a "disabled" field which allows to distinguish them.
It's becoming obvious that start_proxies() is only used to print a
greeting message now, that we'd rather get rid of. Probably that
zombify_proxy() and stop_proxy() should be merged once their
differences move to the right place.
The enabled/disabled config options were stored into a "state" field
that is an integer but contained only PR_STNEW or PR_STSTOPPED, which
is a bit confusing, and causes a dependency with proxies. This was
renamed to "disabled" and is used as a boolean. The field was also
moved to the end of the struct to stop creating a hole and fill another
one.
This state was used to mention that a proxy was in PAUSED state, as opposed
to the READY state. This was causing some trouble because if a listener
failed to resume (e.g. because its port was temporarily in use during the
resume), it was not possible to retry the operation later. Now by checking
the number of READY or PAUSED listeners instead, we can accurately know if
something went bad and try to fix it again later. The case of the temporary
port conflict during resume now works well:
$ socat readline /tmp/sock1
prompt
> disable frontend testme3
> disable frontend testme3
All sockets are already disabled.
> enable frontend testme3
Failed to resume frontend, check logs for precise cause (port conflict?).
> enable frontend testme3
> enable frontend testme3
All sockets are already enabled.
This state is only set when a pause() fails but isn't even set when a
resume() fails. And we cannot recover from this state. Instead, let's
just count remaining ready listeners to decide to emit an error or not.
It's more accurate and will better support new attempts if needed.
Since v1.4 or so, it's almost not possible anymore to set this state. The
only exception is by using the CLI to change a frontend's maxconn setting
below its current usage. This case makes no sense, and for other cases it
doesn't make sense either because "full" is a vague concept when only
certain listeners are full and not all. Let's just remove this unused
state and make it clear that it's not reported. The "ready" or "open"
states will continue to be reported without being misleading as they
will be opposed to "stop".
The proxy state tries to be synthetic but that doesn't work well with
many listeners, especially for transition phases or after a failed
pause/resume.
In order to address this, we'll instead rely on counters of listeners in
a given state for the 3 major states (ready, paused, listen) and a total
counter. We'll now be able to determine a proxy's state by comparing these
counters only.
This function is used as a wrapper to set a listener's state everywhere.
We'll use it later to maintain some counters in a consistent state when
switching state so it's capital that all state changes go through it.
No functional change was made beyond calling the wrapper.
The zombie state is not used anymore by the listeners, because in the
last two cases where it was tested it couldn't match as it was covered
by the test on the process mask. Instead now the FD is either in the
LISTEN state or the INIT state. This also avoids forcing the listener
to be single-dimensional because actually belonging to another process
isn't totally exclusive with the other states, which explains some of
the difficulties requiring to check the proc_mask and the fd sometimes.
So let's get rid of it now not to be tempted to reuse it.
The doc on the listeners state was updated.
Use the new stats module API to integrate the dns counters in the
standard stats. This is done in order to avoid code duplication, keep
the code related to cli out of dns and use the full possibility of the
stats function, allowing to print dns stats in csv or json format.
Add a boolean 'clearable' on stats module structure. If set, it forces
all the counters to be reset on 'clear counters' cli command. If not,
the counters are reset only when 'clear counters all' is used.
This is executed on startup with the registered statistics module. The
existing statistics have been merged in a list containing all
statistics for each domain. This is useful to print all available
statistics in a generic way.
Allocate extra counters for all proxies/servers/listeners instances.
These counters are allocated with the counters from the stats modules
registered on startup.
Implement a small API to easily add extra counters inside a structure
instance. This will be used to implement dynamic statistics linked on
every type of object as needed.
The counters are stored in a dynamic array inside the relevant objects.
A stat module can be registered to quickly add new statistics on
haproxy. It must be attached to one of the available stats domain. The
register must be done using INITCALL on STG_REGISTER.
The stat module has a name which should be unique for each new module in
a domain. It also contains a statistics list with their name/desc and a
pointer to a function used to fill the stats from the module counters.
The module also provides the initial counters values used on
automatically allocated counters. The offset for these counters
are stored in the module structure.
This flag can be used to determine on what type of proxy object the
statistics should be relevant. It will be useful when adding dynamic
statistics. Currently, this flag is not used.
The domain option will be used to have statistics attached to other
objects than proxies/listeners/servers. At the moment, only the PROXY
domain is available.
Add an argument 'domain' on the 'show stats' cli command to specify the
domain. Only 'domain proxy' is available now. If not specified, proxy
will be considered the default domain.
For HTML output, only proxy statistics will be displayed.
Use an opaque pointer to store proxy instance. Regroup server/listener
as a single opaque pointer. This has the benefit to render the structure
more evolutive to support statistics on other types of objects in the
future.
This patch is needed to extend stat support for components other than
proxies objects.
The prometheus module has been adapted for these changes.
Render the stats size parametric in csv/json dump functions. This is
needed for the future patch which provides dynamic stats. For now the
static value ST_F_TOTAL_FIELDS is provided.
Remove unused parameter px on stats_dump_one_line.
This patch is needed to extend stat support to components other than
proxies objects.
Un-mark stats_dump_one_line and stats_putchk as static and export them
in the header file. These functions will be reusable by other components to
print their statistics.
This patch is needed to extend stat support to components other than
proxies objects.
A crash reported in github issue #880 looks impossible unless
pendconn_cond_unlink() occasionally sees a null leaf_p when attempting
to remove an entry, which seems to be confirmed by the reporter. What
seems to be happening is that depending on compiler optimizations,
this pointer can appear as null while pointers are moved if one of
the node's parents is removed from or inserted into the tree. There's
no explicit null of the pointer during these operations but those
pointers are rewritten in multiple steps and nothing prevents this
situation from happening, and there are no particular barrier nor
atomic ops around this.
This test was used to avoid unnecessary locking, for already deleted
entries, but looking at the code it appears that pendconn_free() already
resets s->pend_pos that's used as <p> there, and that the other call
reasons are after an error where the connection will be dropped as
well. So we don't save anything by doing this test, and make it
unsafe. The older code used to check for list emptiness there and
not inside pendconn_unlink(), which explains why the code has stayed
there. Let's just remove this now.
Thanks to @jaroslawr for reporting this issue in great details and for
testing the proposed fix.
This should be backpored to 1.8, where the test on LIST_ISEMPTY should
be moved to pendconn_unlink() instead (inside the lock, just like 2.0+).
Allow the syntax "${...[*]}" to expand an environment variable
containing several values separated by spaces as individual arguments. A
new flag PARSE_OPT_WORD_EXPAND has been added to toggle this feature on
parse_line invocation. In case of an invalid syntax, a new error
PARSE_ERR_WRONG_EXPAND will be triggered.
This feature has been asked on the github issue #165.
As part of his GREASE experiments on Chromium, Bence Bky reported in
https://lists.w3.org/Archives/Public/ietf-http-wg/2020JulSep/0202.html
and https://bugs.chromium.org/p/chromium/issues/detail?id=1127060 that
a certain combination of frame type and frame flags was causing an error
on app.slack.com. It turns out that it's haproxy that is causing this
issue because the frame type is wrongly assumed to support padding, the
frame flags indicate padding is present, and the frame is too short for
this, resulting in an error.
The reason why only some frame types are affected is due to the frame
type being used in a bit shift to match against a mask, and where the
5 lower bits of the frame type only are used to compute the frame bit.
If the resulting frame bit matches a DATA, HEADERS or PUSH_PROMISE frame
bit, then padding support is assumed and the test is enforced, resulting
in a PROTOCOL_ERROR or FRAME_SIZE_ERROR depending on the payload size.
We must never match any such bit for unsupported frame types so let's
add a check for this. This must be backported as far as 1.8.
Thanks to Cooper Bethea for providing enough context to help narrow the
issue down and to Bence Bky for creating a simple reproducer.
We don't need to cheat with the sock_domain anymore, we now always have
the SOCK_DGRAM sock_type as a complementary selector. This patch restores
the sock_domain to AF_INET* in the udp* protocols and removes all traces
of the now unused AF_CUST_*.
The protocol array used to be only indexed by socket family, which is very
problematic with UDP (requiring an extra family) and with the forthcoming
QUIC (also requiring an extra family), especially since that binds them to
certain families, prevents them from supporting dgram UNIX sockets etc.
In order to address this, we now start to register the protocols with more
info, namely the socket type and the control type (either stream or dgram).
This is sufficient for the protocols we have to deal with, but could also
be extended further if multiple protocol variants were needed. But as is,
it still fits nicely in an array, which is convenient for lookups that are
instant.
This one will be needed to more accurately select a protocol. It may
differ from the socket type for QUIC, which uses dgram at the socket
layer and provides stream at the control layer. The upper level requests
a control layer only so we need this field.
Most callers of str2sa_range() need the protocol only to check that it
provides a ->connect() method. It used to be used to verify that it's a
stream protocol, but it might be a bit early to get rid of it. Let's keep
the test for now but move it to str2sa_range() when the new flag PA_O_CONNECT
is present. This way almost all call places could be cleaned from this.
There's a strange test in the server address parsing code that rechecks
the family from the socket which seems to be a duplicate of the previously
removed tests. It will have to be rechecked.
We'll need this so that it can return pointers to stacked protocol in
the future (for QUIC). In addition this removes a lot of tests for
protocol validity in the callers.
Some of them were checked further apart, or after a call to
str2listener() and they were simplified as well.
There's still a trick, we can fail to return a protocol in case the caller
accepts an fqdn for use later. This is what servers do and in this case it
is valid to return no protocol. A typical example is:
server foo localhost:1111
The function will need to use more than just a family, let's pass it
the selected protocol. The caller will then be able to do all the fancy
stuff required to pick the best protocol.
This is at least temporary, as the migration at once is way too difficuly.
For now it still creates listeners but only allows DGRAM sockets. This
aims at easing the split between listeners and receivers.
If a file descriptor was passed, we can optionally return it. This will
be useful for listening sockets which are both a pre-bound FD and a ready
socket.
These flags indicate whether the call is made to fill a bind or a server
line, or even just send/recv calls (like logs or dns). Some special cases
are made for outgoing FDs (e.g. pipes for logs) or socket FDs (e.g external
listeners), and there's a distinction between stream or dgram usage that's
expected to significantly help str2sa_range() proceed appropriately with
the input information. For now they are not used yet.
These flags indicate what is expected regarding port specifications. Some
callers accept none, some need fixed ports, some have it mandatory, some
support ranges, and some take an offset. Each possibilty is reflected by
an option. For now they are not exploited, but the goal is to instrument
str2sa_range() to properly parse that.
We currently have an argument to require that the address is resolved
but we'll soon add more, so let's turn it into a bit field. The old
"resolve" boolean is now PA_O_RESOLVE.
At some places (log fd@XXX, bind fd@XXX) we support using an explicit
file descriptor number, that is placed into the sockaddr for later use.
The problem is that till now it was done with an AF_UNSPEC family, which
is also used for other situations like missing info or rings (for logs).
Let's create an "official" family AF_CUST_EXISTING_FD for this case so
that we are certain the FD can be found in the address when it is set.
This removes the following fields from struct protocol that are now
retrieved from the protocol family instead: .sock_family, .sock_addrlen,
.l3_addrlen, .addrcmp, .bind, .get_src, .get_dst.
This also removes the UDP-specific udp{,6}_get_{src,dst}() functions
which were referenced but not used yet. Their goal was only to remap
the original AF_INET* addresses to AF_CUST_UDP*.
Note that .sock_domain is still there as it's used as a selector for
the protocol struct to be used.
We now take care of retrieving sock_family, l3_addrlen, bind(),
addrcmp(), get_src() and get_dst() from the protocol family and
not just the protocol itself. There are very few places, this was
only seldom used. Interestingly in sock_inet.c used to rely on
->sock_family instead of ->sock_domain, and sock_unix.c used to
hard-code PF_UNIX instead of using ->sock_domain.
Also it appears obvious we have something wrong it the protocol
selection algorithm because sock_domain is the one set to the custom
protocols while it ought to be sock_family instead, which would avoid
having to hard-code some conversions for UDP namely.
We need to specially handle protocol families which regroup common
functions used for a given address family. These functions include
bind(), addrcmp(), get_src() and get_dst() for now. Some fields are
also added about the address family, socket domain (protocol family
passed to the socket() syscall), and address length.
These protocol families are referenced from the protocols but not yet
used.
Note that for now we don't have a sockpair.c file to host that unusual
family, so the new function was placed directly into proto_sockpair.c.
It's no big deal given that this family is currently not shared with
multiple protocols.
The function does almost nothing but setting up the receiver. This is
normal as the socket the FDs are passed onto are supposed to have been
already created somewhere else, and the only usable identifier for such
a socket pair is the receiving FD itself.
The function was assigned to sockpair's ->bind() and is not used yet.
This function performs all the bind-related stuff for UNIX sockets that
was previously done in uxst_bind_listener(). There is a very tiny
difference however, which is that previously, in the unlikely event
where listen() would fail, it was still possible to roll back the binding
and rename the backup to the original socket. Now we have to rename it
before calling returning, hence it will be done before calling listen().
However, this doesn't cover any particular use case since listen() has no
reason to fail there (and the rollback is not done for inherited sockets),
that was just done that way as a generic error processing path.
The code is not used yet and is referenced in the uxst proto's ->bind().
This function collects all the receiver-specific code from both
tcp_bind_listener() and udp_bind_listener() in order to provide a more
generic AF_INET/AF_INET6 socket binding function. For now the API is
not very elegant because some info are still missing from the receiver
while there's no ideal place to fill them except when calling ->listen()
at the protocol level. It looks like some polishing code is needed in
check_config_validity() or somewhere around this in order to finalize
the receivers' setup. The main issue is that listeners and receivers
are created *before* bind_conf options are parsed and that there's no
finishing step to resolve some of them.
The function currently sets up a receiver and subscribes it to the
poller. In an ideal world we wouldn't subscribe it but let the caller
do it after having finished to configure the L4 stuff. The problem is
that the caller would then need to perform an fd_insert() call and to
possibly set the exported flag on the FD while it's not its job. Maybe
an improvement could be to have a separate sock_start_receiver() call
in sock.c.
For now the function is not used but it will soon be. It's already
referenced as tcp and udp's ->bind().
This will be the function that must be used to bind the receiver. It
solely depends on the address family but for now it's simpler to have
it per protocol.
The new RX_O_FOREIGN, RX_O_V6ONLY and RX_O_V4V6 options are now set into
the rx_settings part during the parsing, so that we don't need to adjust
them in each and every listener anymore. We have to keep both v4v6 and
v6only due to the precedence from v6only over v4v6.
It's the receiver's FD that's inherited from the parent process, not
the listener's so the flag must move to the receiver so that appropriate
actions can be taken.
In order to split the receiver from the listener, we'll need to know that
a socket is already bound and ready to receive. We used to do that via
tha LI_O_ASSIGNED state but that's not sufficient anymore since the
receiver might not belong to a listener anymore. The new RX_F_BOUND flag
is used for this.
A receiver will have to pass a context to be installed into the fdtab
for use by the handler. We need to set this into the receiver struct
as the bind will happen longer after the configuration.
Just like listeners keep a pointer to their bind_conf, receivers now also
have a pointer to their rx_settings. All those belonging to a listener are
automatically initialized with a pointer to the bind_conf's settings.
We'll soon add flags for the receivers, better add them to the final
file, so it's time to move the definition to receiver-t.h. The struct
receiver and rx_settings were placed there.
The receiver is the one which depends on the protocol while the listener
relies on the receiver. Let's move the protocol there. Since there's also
a list element to get back to the listener from the proto list, this list
element (proto_list) was moved as well. For now when scanning protos, we
still see listeners which are linked by their rx.proto_list part.
The listening socket is represented by its file descriptor, which is
generic to all receivers and not just listeners, so it must move to
the rx struct.
It's worth noting that in order to extend receivers and listeners to
other protocols such as QUIC, we'll need other handles than file
descriptors here, and that either a union or a cast to uintptr_t
will have to be used. This was not done yet and the field was
preserved under the name "fd" to avoid adding confusion.
In order to start to split the listeners into the listener part and the
event receiver part, we introduce a new field "rx" into struct listener
that will eventually become a separate struct receiver. This patch only
adds the struct with an options field that the receivers will need.
The netns is common to all listeners/receivers and is used to bind the
listening socket so it must be in the receiver settings and not in the
listener. This removes some yet another set of unnecessary loops.
The interface is common to all listeners/receivers and is used to bind
the listening socket so it must be in the receiver settings and not in
the listener. This removes some unnecessary loops.
There currently is a large inconsistency in how binding parameters are
split between bind_conf and listeners. It happens that for historical
reasons some parameters are available at the listener level but cannot
be configured per-listener but only for a bind_conf, and thus, need to
be replicated. In addition, some of the bind_conf parameters are in fact
for the listening socket itself while others are for the instanciated
sockets.
A previous attempt at splitting listeners into receivers failed because
the boundary between all these settings is not well defined.
This patch introduces a level of listening socket settings in the
bind_conf, that will be detachable later. Such settings that are solely
for the listening socket are:
- unix socket permissions (used only during binding)
- interface (used for binding)
- network namespace (used for binding)
- process mask and thread mask (used during startup)
The rest seems to be used only to initialize the resulting sockets, or
to control the accept rate. For now, only the unix params (bind_conf->ux)
were moved there.
Remove the last utility functions for handling the multi-cert bundles
and remove the multi-variable from the ckch structure.
With this patch, the bundles are completely removed.
Since the removal of the multi-certificates bundle support, this
variable is not useful anymore, we can remove all tests for this
variable and suppose that every ckch contains a single certificate.
Commit 4987a4744 ("CLEANUP: tree-wide: use VAR_ARRAY instead of [0] in
various definitions") broke the build on clang due to the tlv field used
to receive/send the proxy protocol. The problem is that struct tlv is
included at the beginning of struct tlv_ssl, which doesn't make much
sense. In fact the value[] array isn't really a var array but just an
end of struct marker, and must really be an array of size zero.
Surprisingly there were still a number of [0] definitions for variable
sized arrays in certain structures all over the code. We need to use
VAR_ARRAY instead of zero to accommodate various compilers' preferences,
as zero was used only on old ones and tends to report errors on new ones.
tcc supports variadic macros provided that there is always at least one
argument, like older gcc versions. Thus we need to always keep one and
define args as the remaining ones. It's not an issue at all and doesn't
change the way to use them, just the internal definitions.
For whatever reason, glibc decided that the __attribute__ keyword is
the exclusive property of gcc, and redefines it to an empty macro on
other compilers. Some non-gcc compilers also support it (possibly
partially), tinycc is one of them. By doing this, glibc silently
broke all constructors, resulting in code that arrives in main() with
uninitialized variables.
The solution we use here consists in undefining the macro on non-gcc
compilers, and redefining it to itself in order to cause a conflict in
the event the redefinition would happen afterwards. This visibly solved
the problem.
Some checks on __GNUC__ imply that if it's undefined it will match a
low value but that's not always what we want, like for example in the
VAR_ARRAY definition which is not needed on tcc. Let's always be explicit
on these tests.
A SRV record keeps a reference on the corresponding additional record, if
any. But this additional record is also inserted in a separate linked-list into
the dns response. The problems arise when obsolete additional records are
released. The additional records list is purged but the SRV records always
reference these objects, leading to an undefined behavior. Worst, this happens
very quickly because additional records are never renewed. Thus, once received,
an additional record will always expire.
Now, the addtional record are only associated to a SRV record or simply
ignored. And the last version is always used.
This patch helps to fix the issue #841. It must be backported to 2.2.
Ever since the protocols were added in 1.3.13, listeners used to be
started twice:
- once by start_proxies(), which iteratees over all proxies then all
listeners ;
- once by protocol_bind_all() which iterates over all protocols then
all listeners ;
It's a real mess because error reporting is not even consistent, and
more importantly now that some protocols do not appear in regular
proxies (peers, logs), there is no way to retry their binding should
it fail on the last step.
What this patch does is to make sure that listeners are exclusively
started by protocols. The failure to start a listener now causes the
emission of an error indicating the proxy's name (as it used to be
the case per proxy), and retryable failures are silently ignored
during all but last attempts.
The start_proxies() function was kept solely for setting the proxy's
state to READY and emitting the "Proxy started" message and log that
some have likely got used to seeking in their logs.
When calling the http_replace_res_status() function, an optional reason may now
be set. It is ignored if it points to NULL and the original reason is
preserved. Only the response status is replaced. Otherwise both the status and
the reason are replaced.
It simplifies the API and most of time, avoids an extra call to
http_replace_res_reason().
The http_replace_req_path() function now takes a third argument to evaluate the
query-string as part of the path or to preserve it. If <with_qs> is set, the
query-string is replaced with the path. Otherwise, only the path is replaced.
This patch is mandatory to fix issue #829. The next commit depends on it. So be
carefull during backports.
For now we still don't retrieve dgram sockets, but the code must be able
to distinguish them before we switch to receivers. This adds a new flag
to the xfer_sock_list indicating that a socket is of type SOCK_DGRAM. The
way to set the flag for now is by looking at the dummy address family which
equals AF_CUST_UDP{4,6} in this case (given that other dgram sockets are not
yet supported).
We'll want to store more info there and some info that are not represented
in listener options at the moment (such as dgram vs stream) so let's get
rid of these and instead use a new set of options (SOCK_XFER_OPT_*).
The new function was called sock_get_old_sockets() and was left as-is
except a minimum amount of style lifting to make it more readable. It
will never be awesome anyway since it's used very early in the boot
sequence and needs to perform socket I/O without any external help.
This code was highly redundant, existing for TCP clients, TCP servers
and UDP servers. Let's move it to sock_inet where it belongs. The new
functions are sock_inet4_make_foreign() and sock_inet6_make_foreign().
This is essentially a merge from tcp_find_compatible_fd() and
uxst_find_compatible_fd() that relies on a listener's address and
compare function and still checks for other variations. For AF_INET6
it compares a few of the listener's bind options. A minor change for
UNIX sockets is that transparent mode, interface and namespace used
to be ignored when trying to pick a previous socket while now if they
are changed, the socket will not be reused. This could be refined but
it's still better this way as there is no more risk of using a
differently bound socket by accident.
Eventually we should not pass a listener there but a set of binding
parameters (address, interface, namespace etc...) which ultimately will
be grouped into a receiver. For now this still doesn't exist so let's
stick to the listener to break dependencies in the rest of the code.