QUIC will rely on UDP at the receiver level, and will need these functions
to suspend/resume the receivers. In the future, protocol chaining may
simplify this.
Currnetly conn_ctrl_init() does an fd_insert() and conn_ctrl_close() does an
fd_delete(). These are the two only short-term obstacles against using a
non-fd handle to set up a connection. Let's have pur these into the protocol
layer, along with the other connection-level stuff so that the generic
connection code uses them instead. This will allow to define new ones for
other protocols (e.g. QUIC).
Since we only support regular sockets at the moment, the code was placed
into sock.c and shared with proto_tcp, proto_uxst and proto_sockpair.
For the sake of an improved readability, let's group the protocol
field members according to where they're supposed to be defined:
- connection layer (note: for now even UDP needs one)
- binding layer
- address family
- socket layer
Nothing else was changed.
The various protocols were made static since there was no point in
exporting them in the past. Nowadays with QUIC relying on UDP we'll
significantly benefit from UDP being exported and more generally from
being able to declare some functions as being the same as other
protocols'.
In an ideal world it should not be these protocols which should be
exported, but the intermediary levels:
- socket layer (sock.c only right now), already exported as functions
but nothing structured at the moment ;
- family layer (sock_inet, sock_unix, sockpair etc): already structured
and exported
- binding layer (the part that relies on the receiver): currently fused
within the protocol
- connectiong layer (the part that manipulates connections): currently
fused within the protocol
- protocol (connection's control): shouldn't need to be exposed
ultimately once the elements above are in an easily sharable way.
This field used to be needed before commit 2b5e0d8b6 ("MEDIUM: proto_udp:
replace last AF_CUST_UDP* with AF_INET*") as it was used as a protocol
entry selector. Since this commit it's always equal to the socket family's
value so it's entirely redundant. Let's remove it now to simplify the
protocol definition a little bit.
At the end of stream_new(), once the input buffer is transfer to the request
channel, it must not be used anymore. The previous patch (16df178b6 "BUG/MEDIUM:
stream: Xfer the input buffer to a fully created stream") was pushed to quickly.
No backport needed.
The input buffer passed as argument to create a new stream must not be
transferred when the request channel is initialized because the channel
flags are not set at this stage. In addition, the API is a bit confusing
regarding the buffer owner when an error occurred. The caller remains the
owner, but reading the code it is not obvious.
So, first of all, to avoid any ambiguities, comments are added on the
calling chain to make it clear. The buffer owner is the caller if any error
occurred. And the ownership is transferred to the stream on success.
Then, to make things simple, the ownership is transferred at the end of
stream_new(), in case of success. And the input buffer is updated to point
on BUF_NULL. Thus, in all cases, if the caller try to release it calling
b_free() on it, it is not a problem. Of course, it remains the caller
responsibility to release it on error.
The patch fixes a bug introduced by the commit 26256f86e ("MINOR: stream:
Pass an optional input buffer when a stream is created"). No backport is
needed.
Since HAProxy 2.3, OpenSSL 1.1.1 is a requirement for using a
multi-certificate bundle in the configuration. This patch emits a fatal
error when HAProxy tries to load a bundle with an older version of
HAProxy.
This problem was encountered by an user in issue #990.
This must be backported in 2.3.
With the removal of the family-specific port setting, all protocol had
exactly the same implementation of ->add(). A generic one was created
with the name "default_add_listener" so that all other ones can now be
removed. The API was slightly adjusted so that the protocol and the
listener are passed instead of the listener and the port.
Note that all protocols continue to provide this ->add() method instead
of routinely calling default_add_listener() from create_listeners(). This
makes sure that any non-standard protocol will still be able to intercept
the listener addition if needed.
This could be backported to 2.3 along with the few previous patches on
listners as a pure code cleanup.
In create_listeners() we iterate over a port range and call the
protocol's ->add() function to add a new listener on the specified
port. Only tcp4/tcp6/udp4/udp6 support a port, the other ones ignore
it. Now that we can rely on the address family to properly set the
port, better do it this way directly from create_listeners() and
remove the family-specific case from the protocol layer.
At various places we need to set a port on an IPv4 or IPv6 address, and
it requires casts that are easy to get wrong. Let's add a new set_port()
helper to the address family to assist in this. It will be directly
accessible from the protocol and will make the operation seamless.
Right now this is only implemented for sock_inet as other families do
not need a port.
When a H1 message is parsed, if the parser state is switched to TUNNEL mode
just after the header parsing, the BODYLESS flag is set on the HTX
start-line. By transitivity, the corresponding flag is set on the message in
HTTP analysers. Thus it is possible to rely on it to not wait for the
request body.
CNT_LEN and TE_CHNK flags must be set on the message only when the
corresponding flag is set on the HTX start-line. Before, when the transfer
length was known XFER_LEN set), the HTTP_MSGF_TE_CHNK was the default. But
it is not appropriate. Now, it is only set if the message is chunked. Thus,
it is now possible to have a known transfer length without CNT_LEN or
TE_CHNK.
In addition, the BODYLESS flags may be set, independently on XFER_LEN one.
This flags is now unused. It was used in REQ_WAIT_HTTP analyser, when a
stream was waiting for a request, to set the keep-alive timeout or to avoid
to send HTTP errors to client.
It is now impossible to start the HTTP request processing in the stream
analysers with a partial or empty request message. The mux-h2 was already
waiting of the request headers before creating the stream. Now the mux-h1
does the same. All errors (aborts, timeout or invalid requests) waiting for
the request headers are now handled by the multiplexers. So there is no
reason to still handle them in the REQ_WAIT_HTTP (http_wait_for_request)
analyser.
To ensure there is no ambiguity, a BUG_ON() was added to exit if a partial
request is received in this analyser.
H1C_F_CS_* flags are renamed into H1C_F_ST_*. They reflect the connection
state. So "ST" is well suited. "CS" is confusing because it is also the
abbreviation for conn-stream.
In addition, H1C flags are reordered.
This is the reason for all previous patches. The conn-stream and the
associated stream are created as later as possible. It only concerns the
frontend connections. But it means the request headers, and possibly the
first data block, are received and parsed before the conn-stream
creation. To do so, an embryonic H1 stream, with no conn-stream, is
created. The result of this "early parsing" is stored in its rx buffer, used
to fill the request channel when the stream is created. During this step,
some HTTP errors may be returned by the mux. It must also handle
http-request/keep-alive timeouts. A significative change is about H1 to H2
upgrade. It happens very early now, and no H1 stream are created (and thus
of course no conn-stream).
The most important part of this patch is located to the h1_process()
function. Because it must trigger the parsing when there is no H1
stream. h1_recv() function has also been simplified.
For now, this part is unsued. But this patch adds functions to handle errors
on idle and embryonic H1 connections and send corresponding HTTP error
messages to the client (400, 408 or 500). Thanks to previous patches, these
functions take care to update the right stats counters, but also the
counters tracked by the session.
A field to store the HTTP error code has been added in the H1C structure. It
is used for error retransmits, if any, and to get it in http logs. It is
used to return the mux exit status code when the MUX_EXIT_STATUS ctl
parameter is requested.
When a log message is emitted from the session level, by a multiplexer,
there is no stream. Thus for HTTP session, there no status code and the
termination flags are not correctly set.
Thanks to previous patch, the HTTP status code is deduced from the mux exist
status, using the MUX_EXIT_STATE ctl param. This is only done for HTTP
frontends. If it is defined ( != 0), it is used to deduce the termination
flags.
The ctl param MUX_EXIT_STATUS can be request to get the exit status of a
multiplexer. For instance, it may be an HTTP status code or an H2 error. For
now, 0 is always returned. When the mux h1 will be able to return HTTP
errors itself, this ctl param will be used to get the HTTP status code from
the logs.
the mux_exit_status enum has been created to map internal mux exist status
to generic one. Thus there is 5 possible status for now: success, invalid
error, timeout error, internal error and unknown.
The cumulative numbers of http requests, http errors, bytes received and
sent and their respective rates for a tracked counters are now updated using
specific stream independent functions. These functions are used by the
stream but the aim is to allow the session to do so too. For now, there is
no reason to perform these updates from the session, except from the mux-h2
maybe. But, the mux-h1, on the frontend side, will be able to return some
errors to the client, before the stream creation. In this case, it will be
mandatory to update counters tracked at the session level.
An idle expiration date is added on the H1 connection with the function to
set it depending on connection state. First, there is no idle timeout on
backend connections, For idle frontend connections, the http-request or
keep-alive timeout are used depending on which timeout is defined and if it
is the first request or not. For embryonic connections, the http-request is
always used, if defined. For attached or shutted down connections, no idle
timeout is applied.
For now the idle expiration date is never set and the h1_set_idle_expiration
function remains unused.
When the conn-stream is detached for a H1 connection, there is no reason to
subscribe for reads or process pending input data if the connection is not
idle. Because, it means a shutdown is pending.
Conditions to set a timeout on the H1C task have been simplified or at least
changed to rely on H1 connection flags. Now, following rules are used :
* the shutdown timeout is applied on dead (not alive) or shutted down
connections.
* The client/server timeout is applied if there are still some pending
outgoing data.
* The client timeout is applied on alive frontend connections with no
conn-stream. It means on idle or embryionic frontend connections.
* For all other connections (backend or attached connections), no timeout
is applied. For frontend or backend attached connections, the timeout is
handled by the application layer. For idle backend connections, there is
no timeout.
We now only rely on one flag to notify a shutdown. The shutdown is performed
at the connection level when there are no more pending outgoing data. So, it
means it is performed immediately if the output buffer is empty. Otherwise
it is deferred after the outgoing data are sent.
This simplify a bit the mux because there is now only one flag to check.
Don't try to read more data if a parsing or a formatting error was reported
on the H1 stream. There is no reason to continue to process the messages for
the current connection in this case. If a parsing error occurs, it means the
input is invalid. If a formatting error occurs, it is an internal error and
it is probably safer to give up.
Mainly to make it easier to read. First of all, when a H1 connection is
still there, we check if the connection was stolen by another thread or
not. If yes we release the task and leave. Then we check if the task is
expired or not. Only expired tasks are considered. Finally, if a conn-stream
is still attached to the connection (H1C_F_CS_ATTACHED flag set), we
return. Otherwise, the task and the H1 connection are released.
Be prepared to have a H1 connection in one of the following states :
* A H1 connection waiting for a new message with no H1 stream.
H1C_F_CS_IDLE flag is set.
* A H1 connection processing a new message with a H1 stream but no
conn-stream attached. H1C_F_CS_EMBRYONIC flag is set
* A H1 connection with a H1 stream and a conn-stream attached.
H1C_F_CS_ATTACHED flag is set.
* A H1 connection with no H1 stream, waiting to be released. No flag is set.
These flags are mutually exclusives. When none is set, it means the
connection will be released ASAP, just remaining outgoing data must be sent
before. For now, the second state (H1C_F_CS_EMBRYONIC) is transient.
For now this buffer is not used. But it will be used to parse the headers,
and possibly the first block of data, when no stream is attached to the H1
connection. The aim is to use it to create the stream, thanks to recent
changes on the streams creation api.
Dedicated functions are now used to create frontend and backend H1
streams. h1c_frt_stream_new() is now used to create frontend H1 streams and
h1c_bck_stream_new() to create backend ones. Both rely on h1s_new() function
to allocate the stream itself. It is a bit easier to add specific processing
depending we are on the frontend or the backend side.
Instead of using H1S flags to report an error on the request or the
response, independently it is a parsing or a formatting error, we now use a
flag to report parsing errors and another one to report formatting
ones. This simplify the message parsing. It is also easier to figure out
what error happened when one of this flag is set. The side may be deduced
checking the H1C_F_IS_BACK flag.
Instead of using 2 flags on the H1 stream (H1S_F_BUF_FLUSH and
H1S_F_SPLICED_DATA), we now only use one flag on the H1 connection
(H1C_F_WANT_SPLICE) to notify we want to use splicing or we are using
splicing. This flag blocks the calls to rcv_buf() connection callback.
It is a bit easier to set the H1 connection capability to receive data in
its input buffer instead of relying on the H1 stream.
H1C_F_WAIT_OPPOSITE must be set on the H1 conenction to don't read more data
because we must be sync with the opposite side. This flag replaces the
H1C_F_IN_BUSY flag. Its name is a bit explicit. It is automatically set on
the backend side when the mux is created. It is safe to do so because at
this stage, the request has not yet been sent to the server. This way, in
h1_recv_allowed(), a test on this flag is enough to block the reads instead
of testing the H1 stream state on the backend side.
It is now possible to set the buffer used by the channel request buffer when
a stream is created. It may be useful if input data are already received,
instead of waiting the first call to the mux rcv_buf() callback. This change
is mandatory to support H1 connection with no stream attached.
For now, the multiplexers don't pass any buffer. BUF_NULL is thus used to
call stream_create_from_cs().
These info are only provided by the mux-h1. But, thanks to previous patches,
we can get them from the session directly. There is no need to retrieve them
from the mux anymore.
Since the idle duration provided by the session is always up-to-date, there
is no more reason to rely on the multiplexer cs_info to set it to the
stream.
When a log message is emitted from the session, using sess_log() function,
there is no stream available. In this case, instead of deducing the idle
duration from the accept date, we use the one provided by the session. 0 is
used if it is undefined (i.e set to -1).
These info are reset for the next transaction, if the connection is kept
alive. From the stream point of view, it should be the same a new
connection, except there is no handshake. Thus the handshake duration is set
to 0.
The idle duration between two streams is added to the session structure. It
is not necessarily pertinent on all protocols. In fact, it is only defined
for H1 connections. It is the duration between two H1 transactions. But the
.get_cs_info() callback function on the multiplexers only exists because
this duration is missing at the session level. So it is a simplification
opportunity for a really low cost.
To reduce the cost, a hole in the session structure is filled by moving
.srv_list field at the end of the structure.
IDLE frontend connections have no stream attached. The stream is only
created when new data are received, when the parsing of the next request
starts. Thus the keep-alive timeout, handled into the HTTP analysers, is not
considered while nothing is received. But this is especially when this
timeout must be considered. Concretely the http-keep-alive is ignored while
no data are received. Only the client timeout is used. It will only be
considered on incomplete requests, if the http-request timeout is not set.
To fix the bug, the http-keep-alive timeout must be handled at the mux
level, for IDLE frontend connection only.
This patch should fix the issue #984. It must be backported as far as
2.2. On prior versions, the stream is created earlier. So, it is not a
problem, except if this behavior changes of course (it was an optim of the
2.2, but don't remember the commit).
A copy-paste bug between {tcp,udp}{4,6}_add_listener() resulted in
using a struct sockaddr_in to set the TCP/UDP port while it ought to
be a struct sockaddr_in6. Fortunately, the port has the same offset
(2) in both so it was harmless. A cleaner way to proceed would be
to have a set_port function exported by the address family layer.
This needs to be backported to 2.3.
It seems to me that lua_close() must be called on all states at deinit
time, not just the first two ones. This is likely a remnant of commit
59f11be43 ("MEDIUM: lua-thread: Add the lua-load-per-thread directive").
There should likely be some memory leak reports when using Lua without
this fix, though none were observed for now.
No backport is needed as this was merged into 2.4-dev.
Lua dedicated TCP, HTTP and SSL socket and proxies must be initialized
once. Right now, they are initialized from the Lua init state, but since
commit 59f11be43 ("MEDIUM: lua-thread: Add the lua-load-per-thread
directive") this function is called one time per lua context. This
caused some fields to be cleared and overwritten, and pre-allocated
object to be lost. This is why the address sanitizer detected memory
leaks from the socket_ssl server initialization.
Let's move all the state-independent part of the function to the
hlua_init() function to avoid this.
No backport is needed, this is only 2.4-dev.
In case of successful unsafe method on a stored resource, the cached entry
must be invalidated (see RFC7234#4.4).
A "non-error response" is one with a 2xx (Successful) or 3xx (Redirection)
status code.
This implies that the primary hash must now be calculated on requests
that have an unsafe method (POST or PUT for instance) so that we can
disable the corresponding entries when we process the response.
The Cache-Control max-age and s-maxage directives should be followed by
a positive numerical value (see RFC 7234#5.2.1.1). According to the
specs, a sender "should not" generate a quoted-string value but we will
still accept this format.
When a response has an Age header (filled in by another cache on the
message's path) that is greater than its defined maximum age (extracted
either from cache-control directives or an expires header), it is
already stale and should not be cached.
The goal is to allow execution of one main lua state per thread.
This patch contains the main job. The lua init is done using these
steps:
- "lua-load-per-thread" loads the lua code in the first thread
- it creates the structs
- it stores loaded files
- the 1st step load is completed (execution of hlua_post_init)
and now, we known the number of threads
- we initilize lua states for all remaining threads
- for each one, we load the lua file
- for each one, we execute post-init
Once all is loaded, we control consistency of functions references.
The rules are:
- a function reference cannot be in the shared lua state and in
a per-thread lua state at the same time.
- if a function reference is declared in a per-thread lua state, it
must be declared in all per-thread lua states
The goal is to allow execution of one main lua state per thread.
The array introduces storage of one reference per thread, because each
lua state can have different reference id for a same function. A function
returns the preferred state id according to configuration and current
thread id.
The goal is to allow execution of one main lua state per thread.
"state_from" is a pointer to the parent lua state. "state_id"
is the index of the parent state id in the reference lua states
array. "state_id" is better because the lock is a "== 0" test
which is quick than pointer comparison. In other way, the state_id
index could index other things the the Lua state concerned. I
think to the function references.
The goal is to allow execution of one main lua state per thread.
This function will initialize the struct with other things than 0.
With this function helper, the initialization is centralized and
it prevents mistakes. This patch also keeps a reference to each
declared function in a list. It will be useful in next patches to
control consistency of declared references.
The goal is to allow execution of one main lua state per thread.
The array of states is initialized at the max number of thread +1.
We define the index 0 is the common state shared by all threads
and should be locked. Other index index are dedicated to each
one thread. The old gL now becomes hlua_states[0].
The goal is to allow execution of one main lua state per thread.
This patch opens the way to addition of a per-thread dedicated lua state.
By passing the hlua we can figure the original state that's been used
and decide to lock or not.
The goal is to allow execution of one main lua state per thread.
Stop using locks in init part, we will use only in parts where
the parent lua state is known, so we could take decision about lock
according with the lua parent state.
The goal is to allow execution of one main lua state per thread.
This commit introduces this variable in the core. Lua state initialized
by thread will have access to this variable, which reports the executing
thread. 0 indicates the shared thread. Programs which must be executed
only once can check for core.thread <= 1.
The goal is to allow execution of one main lua state per thread.
This function will be called for each initialized lua state, so
one per thread. The split transforms the lua state variable from
global to local.
The goal is to allow execution of one main lua state per thread.
This function will be called once per thread, using different Lua
states. This patch prepares the work.
The goal is to allow execution of one main lua state per thread.
The function hlua_ctx_init() now gets the original lua state from
its caller. This allows the initialisation of lua_thread (coroutines)
from any master lua state.
The parent lua state is stored in the hlua struct.
This patch is a temporary transition, it will be modified later.
The goal is to allow execution of one main lua state per thread.
This is a preparative work in order to init more than one stack
in the lua-thread objective.
The goal is to allow execution of one main lua state per thread.
Because this struct will be filled after the configuration parser, we
cannot copy the content. The actual state of the Haproxy code doesn't
justify this change, it is an update preparing next steps.
The goal is to no longer use "struct hlua" with global main lua_state.
The usage of the "struct hlua" is no longer required. This patch replaces
this struct by another one.
Now, the usage of runtime Lua phase is separated from the start lua phase.
The goal is to no longer use "struct hlua" with global main lua_state.
This patch returns NULL value when some code tries go get the hlua struct
associated with a task through hlua_gethlua(). This functions is useful
only during runtime because the struct hlua contains only runtime states.
Some Lua functions allowed to yield are called from init environment.
I'm not sure this is a good practice. Maybe it will be clever to
disallow calling this kind of functions.
The goal is no longer using "struct hlua" with global main lua_state.
if somewhere in the code, hlua_ctx_renew() is called with a global Lua
context, we have a serious bug. A crash is better than working with
this bug, so this patch remove a useless control.
In other way, this control were used during hlua_post_init() function.
The function hlua_post_init() used a call to the runtime hlua_ctx_resume()
function. This call no longer exists.
The goal is to no longer use "struct hlua" with global main lua_state.
The hlua_post_init() is executed during start phase, it does not require
yielding nor any advanced runtime error processing. Let's simplify this
by re-implementing the code using lower-level functions which directly
take a state and not an hlua anymore.
The goal is to no longer use "struct hlua" with global main lua_state
and directly take the state instead.
This patch removes the implicit dependency to this struct with
the function hlua_prepend_path()
Let's switch memory accounting to atomics so that the allocator function
may safely be used from concurrent Lua states.
Given that this function is extremely hot on the call path, we try to
optimize it for the most common case, which is:
- no limit
- there's enough memory
The accounting is what is particuarly expensive in threads since all
CPUs compete for a cache line, so when the limit is not used, we don't
want to use accounting. However we need to preserve it during the boot
phase until we may parse a "tune.lua.maxmem" value. For this, we turn
the unlimited "0" value to ~0 at the end of the boot phase to mark the
definite end of accounting. The function then detects this value and
directly jumps to realloc() in this case.
When the limit is enforced however, we use a CAS to check and reserve
our share of memory, and we roll back on failure. The CAS is used both
for increments and decrements so that a single operation is enough to
update the counters.
The function really has the semantics of a realloc() except that it
also passes the old size to help with accounting. No need to special
case the free or malloc, realloc does everything we need.
If the session is not established, the applet handler could leave
with the applet detached from the ring. At next call, the attach
counter will be decreased again causing unpredectable behavior.
This patch should be backported on branches >=2.2
With commit a1f12746b ("MINOR: traces: add a new level "error" below
the "user" level") a new trace level was inserted, resulting in
shifting all exiting ones by one. But the levels reported in the
__trace() function were not updated accordingly, resulting in the
TRACE_LEVEL_DEVELOPER not to be properly reported anymore. This
patch fixes it by extending the number of levels to 6.
No backport is needed.
When many concurrent requests targeting the same resource were seen, the
cache could sometimes be filled by too many partial responses resulting
in the impossibility to cache a single one of them. This happened
because the actual tree insertion happened only after all the payload of
every response was seen. So until then, every response was added to the
cache because none of the streams knew that a similar request/response
was already being treated.
This patch consists in adding the cache_entry as soon as possible in the
tree (right after the first packet) so that the other responses do not
get cached as well (if they have the same primary key).
A "complete" flag is also added to the cache_entry so that we know if
all the payload is already stored in the entry or if it is still being
processed.
Turn the "Accept-Encoding" value to lower case before processing it.
Calculate the CRC on every token instead of a sorted concatenation of
them all (in order to avoir copying them) then XOR all the CRCs into a
single hash (while ignoring duplicates).
Lua allows registering multiple sample-fetches, converters, action, cli,
applet/services with the same name. This is absolutely useless since only
the first registration will be used. This patch sends a warning if the case
is encountered.
This pach could be backported until 1.8, with the 3 associated patches:
- MINOR: actions: Export actions lookup functions
- MINOR: actions: add a function returning a service pointer from its name
- MINOR: cli: add a function to look up a CLI service description
This function will be useful to check if the keyword is already registered.
Also add a define for the max number of args.
This will be needed by a next patch to fix a bug and will have to be
backported.
This function simply calls action_lookup() on the private service_keywords,
to look up a service name. This will be used to detect double registration
of a same service from Lua.
This will be needed by a next patch to fix a bug and will have to be
backported.
These functions will be useful to check if a keyword is already registered.
This will be needed by a next patch to fix a bug, and will need to be
backported.
Operation luaL_openlibs() and lua_prepend path are processed whithout
the safe context, so in case of failure Haproxy aborts or stops without
error message.
This patch could be backported until 1.8
"lua-load" doesn't check if the expected parameter is present. It tries to
open() directly the argument at second position. So if the filename is
omitted, it tries to load an empty filename.
This patch could be backported until 1.8
The stats on haproxy.org reported ~12k GOAWAY for ~34k connections, with
only 2 protocol errorss. It turns out that the GOAWAY frame counter added
in commit a8879238c ("MINOR: mux-h2: report detected error on stats")
matches a bit too many situations. First it counts those which are not
sent as well as failed retries, second it counts as errors the cases of
attempts to cleanly close, while it's titled "GOAWAY sent on detected
error". Let's address this by moving the counter up one line and excluding
the clean codes.
This can be backported to 2.3.
A number of traces could be added, and a few TRACE_PROTO were replaced
with TRACE_ERROR. The goal is to be able to enable error tracing only
to detect anomalies.
It looks like they're mostly correct as they don't seem to strike on
valid H2 traffic but are very verbose on h2spec.
Sometimes it would be nice to be able to only trace abnormal events such
as protocol errors. Let's add a new "error" level below the "user" level
for this. This will allow to add TRACE_ERROR() at various error points
and only see them.
Since commit a8879238c ("MINOR: mux-h2: report detected error on stats")
we now have some error stats on stream/connection level protocol errors,
but some were improperly marked as stream while they're connection, and
2 or 3 relevant ones were missing and have now been added.
This could be backported to 2.3.
This patch adds a new logging variable '%HPO' for logging HTTP path only
(without query string) from relative or absolute URI.
For example:
log-format "hpo=%HPO hp=%HP hu=%HU hq=%HQ"
GET /r/1 HTTP/1.1
=>
hpo=/r/1 hp=/r/1 hu=/r/1 hq=
GET /r/2?q=2 HTTP/1.1
=>
hpo=/r/2 hp=/r/2 hu=/r/2?q=2 hq=?q=2
GET http://host/r/3 HTTP/1.1
=>
hpo=/r/3 hp=http://host/r/3 hu=http://host/r/3 hq=
GET http://host/r/4?q=4 HTTP/1.1
=>
hpo=/r/4 hp=http://host/r/4 hu=http://host/r/4?q=4 hq=?q=4
Since 2.3 default local log format always adds hostame field.
This behavior change was due to log/sink re-work, because according
to rfc3164 the hostname field is mandatory.
This patch re-introduce a legacy "local" format which is analog
to rfc3164 but with hostname stripped. This is the new
default if logs are generated by haproxy.
To stay compliant with previous configurations, the option
"log-send-hostname" acts as if the default format is switched
to rfc3164.
This patch addresses the github issue #963
This patch should be backported in branches >= 2.3.
In issue #958 Ashley Penney reported intermittent crashes on AWS's ARM
nodes which would not happen on x86 nodes. After investigation it turned
out that the Neoverse N1 CPU cores used in the Graviton2 CPU are much
more aggressive than the usual Cortex A53/A72/A55 or any x86 regarding
memory ordering.
The issue that was triggered there is that if a tasklet_wakeup() call
is made on a tasklet scheduled to run on a foreign thread and that
tasklet is just being dequeued to be processed, there can be a race at
two places:
- if MT_LIST_TRY_ADDQ() happens between MT_LIST_BEHEAD() and
LIST_SPLICE_END_DETACHED() if the tasklet is alone in the list,
because the emptiness tests matches ;
- if MT_LIST_TRY_ADDQ() happens during LIST_DEL_INIT() in
run_tasks_from_lists(), then depending on how LIST_DEL_INIT() ends
up being implemented, it may even corrupt the adjacent nodes while
they're being reused for the in-tree storage.
This issue was introduced in 2.2 when support for waking up remote
tasklets was added. Initially the attachment of a tasklet to a list
was enough to know its status and this used to be stable information.
Now it's not sufficient to rely on this anymore, thus we need to use
a different information.
This patch solves this by adding a new task flag, TASK_IN_LIST, which
is atomically set before attaching a tasklet to a list, and is only
removed after the tasklet is detached from a list. It is checked
by tasklet_wakeup_on() so that it may only be done while the tasklet
is out of any list, and is cleared during the state switch when calling
the tasklet. Note that the flag is not set for pure tasks as it's not
needed.
However this introduces a new special case: the function
tasklet_remove_from_tasklet_list() needs to keep both states in sync
and cannot check both the state and the attachment to a list at the
same time. This function is already limited to being used by the thread
owning the tasklet, so in this case the test remains reliable. However,
just like its predecessors, this function is wrong by design and it
should probably be replaced with a stricter one, a lazy one, or be
totally removed (it's only used in checks to avoid calling a possibly
scheduled event, and when freeing a tasklet). Regardless, for now the
function exists so the flag is removed only if the deletion could be
done, which covers all cases we're interested in regarding the insertion.
This removal is safe against a concurrent tasklet_wakeup_on() since
MT_LIST_DEL() guarantees the atomic test, and will ultimately clear
the flag only if the task could be deleted, so the flag will always
reflect the last state.
This should be carefully be backported as far as 2.2 after some
observation period. This patch depends on previous patch
"MINOR: task: remove __tasklet_remove_from_tasklet_list()".
This function is only used at a single place directly within the
scheduler in run_tasks_from_lists() and it really ought not be called
by anything else, regardless of what its comment says. Let's delete
it, move the two lines directly into the call place, and take this
opportunity to factor the atomic decrement on tasks_run_queue. A comment
was added on the remaining one tasklet_remove_from_tasklet_list() to
mention the risks in using it.
In process_runnable_tasks(), we walk the run queue and pick tasks to
insert them into the local list. And for each of these operations we
perform a few increments, some of which are atomic, and they're even
performed under the runqueue's lock. This is useless inside the loop,
better do them at the end, since we don't use these values inside the
loop and they're not used anywhere else either during this time. The
only one is task_list_size which is accessed in parallel by other
threads performing remote tasklet wakeups, but it's already
approximative and is used to decide to get out of the loop when the
limit is reached. So now we compute it first as an initial budget
instead.
This function is only called at a single place and adds more confusion
than it removes. It also makes one think it could be used outside of
the scheduler while it must absolutely not. Let's just move its two
lines to the call place, making the code more readable there. In
addition this clearly shows that the preliminary LIST_INIT() is
useless since the entry is immediately overwritten.
Commit a5a447984 ("MINOR: debug: add "debug dev sched" to stress the
scheduler.") doesn't scale with threads because ha_random64() takes care
of being totally thread-safe for use with UUIDs. We don't need this for
the stress-testing functions, let's just implement a xorshift PRNG
instead. On 8 threads the performance jumped from 230k ctx/s with 96%
spent in ha_random64() to 14M ctx/s.
This command supports starting a bunch of tasks or tasklets, either on the
current thread (mask=0), all (default), or any set, either single-threaded
or multi-threaded, and possibly auto-scheduled.
These tasks/tasklets will randomly pick another one to wake it up. The
tasks only do it 50% of the time while tasklets always wake two tasks up,
in order to achieve roughly 50% load (since the target might already be
woken up).
res.body may be called from a health-check. It is probably never used. But it is
possibe. In such case, there is no channel. Thus we must not use it
unconditionally to set the flag SMP_F_MAY_CHANGE on the smp.
Now the condition test the channel first. In addtion, the flag is not set if the
payload is fully received.
This patch must be backported as far as 2.2.
L7OKC may now be used as an error status for an HTTP/TCP expect rule. Thus
it is for instance possible to write:
option httpchk GET /isalive
http-check expect status 200,404
http-check expect status 200 error-status L7OKC
It is more or less the same than the disable-on-404 option except that if a
DOWN is up again but still replying a 404 will be set to NOLB state. While
it will stay in DOWN state with the disable-on-404 option.
Regarding the health counter, a check finished with the CONDPASS result is
now the same than with the PASSED result: The health counter is always
incemented. Before, it was only performed is the health counter was not 0.
There is no change for the disable-on-404 option because it is only
evaluated for running or stopping servers. So with an health check counter
greater than 0. But it will make possible to handle (STOPPED -> STOPPING)
transition for servers.
The parsing of the check options based on tcp-check rules (redis, spop,
smtp, http...) are moved aways from check.c. Now, these functions are placed
in tcpcheck.c. These functions are only related to the tcpcheck ruleset
configured on a proxy and not to the health-check attached to a server.
This option is now deprecated. It is recent, but it is now marked as
deprecated as far as 2.2. Thus, there is now a warning in the 2.4 if this
option is still used. It will be removed in 2.5.
Becaue the 2.3 is quite new, this patch may be backported to 2.3.
This option is now ignored because I/O check buffers are now allocated using the
buffer pool. Thus, it is marked as deprecated in the documentation and ignored
during the configuration parsing. The field is also removed from the global
structure.
Because this option is ignored since a recent fix, backported as fare as 2.2,
this patch should be backported too. Especially because it updates the
documentation.
The special handling of in-progress connect rules at the begining of
tcpcheck_main() function can be removed. Instead, at the begining of the
tcpcheck_eval_connect() function, we test is there is already an existing
connection. In this case, it means we are waiting for a connection
establishment. In addition, before evaluating a new connect rule, we take
care to release any previous connection.
Historically, the input and output buffers of a check are allocated by hand
during the startup, with a specific size (not necessarily the same than
other buffers). But since the recent refactoring of the checks to rely
exclusively on the tcp-checks and to use the underlying mux layer, this part
is totally buggy. Indeed, because these buffers are now passed to a mux,
they maybe be swapped if a zero-copy is possible. In fact, for now it is
only possible in h2_rcv_buf(). Thus the bug concretely only exists if a h2
health-check is performed. But, it is a latent bug for other muxes.
Another problem is the size of these buffers. because it may differ for the
other buffer size, it might be source of bugs.
Finally, for configurations with hundreds of thousands of servers, having 2
buffers per check always allocated may be an issue.
To fix the bug, we now allocate these buffers when required using the buffer
pool. Thus not-running checks don't waste memory and muxes may swap them if
possible. The only drawback is the check buffers have now always the same
size than buffers used by the streams. This deprecates indirectly the
"tune.chksize" global option.
In addition, the http-check regtest have been update to perform some h2
health-checks.
Many thanks to @VigneshSP94 for its help on this bug.
This patch should solve the issue #936. It relies on the commit "MINOR:
tcpcheck: Don't handle anymore in-progress send rules in tcpcheck_main".
Both must be backport as far as 2.2.
bla
The special handling of in-progress send rules at the begining of
tcpcheck_main() function can be removed. Instead, at the begining of the
tcpcheck_eval_send() function, we test is there is some data in the output
buffer. In this case, it means we are evaluating an unfinished send rule and
we can jump to the sending part, skipping the formatting part.
This patch is mandatory for a major fix on the checks and must be backported
as far as 2.2.
When a new kind of check is found during the parsing of a proxy section (via
an option directive), we must reset tcpcheck flags for this proxy. It is
mandatory to not inherit some flags from a previously declared check (for
instance in the default section).
This patch must be backported as far as 2.2.
Building with gcc-9.3.0 without threads may result in this warning:
In file included from include/haproxy/api-t.h:36,
from include/haproxy/api.h:33,
from src/fd.c:90:
src/fd.c: In function 'updt_fd_polling':
include/haproxy/fd.h:507:11: warning: array subscript 63 is above array bounds of 'int[1]' [-Warray-bounds]
507 | DISGUISE(write(poller_wr_pipe[tid], &c, 1));
include/haproxy/compiler.h:92:41: note: in definition of macro 'DISGUISE'
92 | #define DISGUISE(v) ({ typeof(v) __v = (v); ALREADY_CHECKED(__v); __v; })
| ^
src/fd.c:113:5: note: while referencing 'poller_wr_pipe'
113 | int poller_wr_pipe[MAX_THREADS]; // Pipe to wake the threads
| ^~~~~~~~~~~~~~
gcc is wrong but this time it cannot be blamed because it doesn't know
that the FD's thread_mask always has at least one bit set. Let's add
the test for all_threads_mask there. It will also remove that test and
drop the else block.
Another bug in the peers message parser was uncovered by last commit
1dfd4f106 ("BUG/MEDIUM: peers: fix decoding of multi-byte length in
stick-table messages"): the function return on incomplete message does
not check if the channel has a pending close before deciding to return
0. It did not hurt previously because the loop calling co_getblk() once
per character would have depleted the buffer and hit the end, causing
<0 to be returned and matching the condition. But now that we process
at once what is available this cannot be relied on anymore and it's
now clearly visible that the final check is missing.
What happens when this strikes is that if a peer connection breaks in
the middle of a message, the function will return 0 (missing data) but
the caller doesn't check for the closed buffer, subscribes to reads,
and the applet handler is immediately called again since some data are
still available. This is detected by the loop prevention and the process
dies complaining that an appctx is spinning.
This patch simply adds the check for closed channel. It must be
backported to the same versions as the fix above.
Since commit 3d08236cb3 HAProxy can be trivially
crashed remotely by sending an `accept-encoding` HTTP request header that
contains 16 commas.
This is because the `values` array in `accept_encoding_normalizer` accepts only
16 entries and it is not verified whether the end is reached during looping.
Fix this issue by checking the length. This patch also simplifies the ist
processing in the loop, because it manually calculated offsets and lengths,
when the ist API exposes perfectly safe functions to advance and truncate ists.
I wonder whether the accept_encoding_normalizer function is able to re-use some
existing function for parsing headers that may contain lists of values. I'll
leave this evaluation up to someone else, only patching the obvious crash.
This commit is 2.4-dev specific and was merged just a few hours ago. No
backport needed.
The cache section's process-vary option takes a 0 or 1 value to disable
or enable the vary processing.
When disabled, a response containing such a header will never be cached.
When enabled, we will calculate a preliminary hash for a subset of request
headers on all the incoming requests (which might come with a cpu cost) which
will be used to build a secondary key for a given request (see RFC 7234#4.1).
The default value is 0 (disabled).
Calculate a preliminary secondary key for every request we see so that
we can have a real secondary key if the response is cacheable and
contains a manageable Vary header.
The cache's ebtree is now allowed to have multiple entries with the same
primary key. Two of those entries will be distinguished thanks to
secondary keys stored in the cache_entry (based on hashes of a subset of
their headers).
When looking for an entry in the cache (cache_use), we still use the
primary key (built the same way as before), but in case of match, we
also need to check if the entry has a vary signature. If it has one, we
need to perform an extra check based on the newly built secondary key.
We will only be able to forge a response out of the cache if both the
primary and secondary keys match with one of our entries. Otherwise the
request will be forwarder to the server.
The Vary functionality is based on a secondary key that needs to be
calculated for every request to which a server answers with a Vary
header. The Vary header, which can only be found in server responses,
determines which headers of the request need to be taken into account in
the secondary key. Since we do not want to have to store all the headers
of the request until we have the response, we will pre-calculate as many
sub-hashes as there are headers that we want to manage in a Vary
context. We will only focus on a subset of headers which are likely to
be mentioned in a Vary response (accept-encoding and referer for now).
Every managed header will have its own normalization function which is
in charge of transforming the header value into a core representation,
more robust to insignificant changes that could exist between multiple
clients. For instance, two accept-encoding values mentioning the same
encodings but in different orders should give the same hash.
This patch adds a function that parses a Vary header value and checks if
all the values belong to our supported subset. It also adds the
normalization functions for our two headers, as well as utility
functions that can prebuild a secondary key for a given request and
transform it into an actual secondary key after the vary signature is
determined from the response.
When at least one data filter is registered on a channel, the offsets of all
filters must be kept up to date. For data filters but also for others. It is
safer to do it in that way. Indirectly, this patch fixes 2 hidden bugs
revealed by the commit 22fca1f2c ("BUG/MEDIUM: filters: Forward all filtered
data at the end of http filtering").
The first one, the worst of both, happens at the end of http filtering when
at least one data filtered is registered on the channel. We call the
http_end() callback function on the filters, when defined, to finish the
http filtering. But it is performed for all filters. Before the commit
22fca1f2c, the only risk was to call the http_end() callback function
unexpectedly on a filter. Now, we may have an overflow on the offset
variable, used at the end to forward all filtered data. Of course, from the
moment we forward an arbitrary huge amount of data, all kinds of bad things
may happen. So offset computation is performed for all filters and
http_end() callback function is called only for data filters.
The other one happens when a data filter alter the data of a channel, it
must update the offsets of all previous filters. But the offset of non-data
filters must be up to date, otherwise, here too we may have an integer
overflow.
Another way to fix these bugs is to always ignore non-data filters from the
offsets computation. But this patch is safer and probably easier to
maintain.
This patch must be backported in all versions where the above commit is. So
as far as 2.0.
Restore init of log-format list in parse_http_del_header which was
accidently deleted by commit ebdd4c55da
(implementation of different header matching methods for
http-request/response del-header).
This is related to GitHub issue #909
"RAND_keep_random_devices_open" is OpenSSL specific, does not present
in other OpenSSL variants like LibreSSL or BoringSSL. BoringSSL recently
"updated" its internal openssl version to 1.1.1, we temporarily set it
back to 1.1.0, as we are going to remove that hack, let us add proper
guarding.
Level-7 retries are only possible with a restricted number of HTTP
return codes. While it is usually not safe to retry on 401 and 403, I
came up with an authentication backend which was not synchronizing
authentication of users. While not perfect, being allowed to also retry
on those return codes is really helpful and acts as a hotfix until we
can fix the backend.
Signed-off-by: Julien Pivotto <roidelapluie@inuits.eu>
This patch adds -m flag which allows to specify header name
matching method when deleting headers from http request/response.
Currently beg, end, sub, str and reg are supported.
This is related to GitHub issue #909
Function __http_find_header is used to search headers by name using specified
matching method. Matching by substring returned unexpected results due to wrong
length of substring supplied to strnistr function.
Fixed also the boolean condition by inverting it, as we're interested in
headers that contains the substring.
This patch should be backported as far as 2.2
Baptiste reported a new crash affecting 2.3 which can be triggered
when using H2 on the backend, with http-reuse always and with a tens
of clients doing close only. There are a few combined cases which cause
this to happen, but each time the issue is the same, an already freed
session is dereferenced in session_unown_conn().
Two cases were identified to cause this:
- a connection referencing a session as its owner, which is detached
from the session's list and is destroyed after this session ends.
The test on conn->owner before calling session_unown_conn() is not
sufficent as the pointer is not null but is not valid anymore.
- a connection that never goes idle and that gets killed form the
mux, where session_free() is called first, then conn_free() calls
session_unown_conn() which scans the just freed session for older
connections. This one is only triggered with DEBUG_UAF
The reason for this session to be present here is that it's needed during
the connection setup, to be passed to conn_install_mux_be() to mux->init()
as the owning session, but it's never deleted aftrewards. Furthermore, even
conn_session_free() doesn't delete this pointer after freeing the session
that lies there. Both do definitely result in a use-after-free that's more
easily triggered under DEBUG_UAF.
This patch makes sure that the owner is always deleted after detaching
or killing the session. However it is currently not possible to clear
the owner right after a synchronous init because the proxy protocol
apparently needs it (a reg test checks this), and if we leave it past
the connection setup with the session not attached anywhere, it's hard
to catch the right moment to detach it. This means that the session may
remain in conn->owner as long as the connection has never been added to
nor removed from the session's idle list. Given that this patch needs to
remain simple enough to be backported, instead it adds a workaround in
session_unown_conn() to detect that the element is already not attached
anywhere.
This fix absolutely requires previous patch "CLEANUP: connection: do not
use conn->owner when the session is known" otherwise the situation will
be even worse, as some places used to rely on conn->owner instead of the
session.
The fix could theorically be backported as far as 1.8. However, the code
in this area has significantly changed along versions and there are more
risks of breaking working stuff than fixing real issues there. The issue
was really woken up in two steps during 2.3-dev when slightly reworking
the idle conns with commit 08016ab82 ("MEDIUM: connection: Add private
connections synchronously in session server list") and when adding
support for storing used H2 connections in the session and adding the
necessary call to session_unown_conn() in the muxes. But the same test
managed to crash 2.2 when built in DEBUG_UAF and patched like this,
proving that we used to already leave dangling pointers behind us:
| diff --git a/include/haproxy/connection.h b/include/haproxy/connection.h
| index f8f235c1a..dd30b5f80 100644
| --- a/include/haproxy/connection.h
| +++ b/include/haproxy/connection.h
| @@ -458,6 +458,10 @@ static inline void conn_free(struct connection *conn)
| sess->idle_conns--;
| session_unown_conn(sess, conn);
| }
| + else {
| + struct session *sess = conn->owner;
| + BUG_ON(sess && sess->origin != &conn->obj_type);
| + }
|
| sockaddr_free(&conn->src);
| sockaddr_free(&conn->dst);
It's uncertain whether an existing code path there can lead to dereferencing
conn->owner when it's bad, though certain suspicious memory corruption bugs
make one think it's a likely candidate. The patch should not be hard to
adapt there.
Backports to 2.1 and older are left to the appreciation of the person
doing the backport.
A reproducer consists in this:
global
nbthread 1
listen l
bind :9000
mode http
http-reuse always
server s 127.0.0.1:8999 proto h2
frontend f
bind :8999 proto h2
mode http
http-request return status 200
Then this will make it crash within 2-3 seconds:
$ h1load -e -r 1 -c 10 http://0:9000/
If it does not, it might be that DEBUG_UAF was not used (it's harder then)
and it might be useful to restart.
At a few places we used to rely on conn->owner to retrieve the session
while the session is already known. This is not correct because at some
of these points the reason the connection's owner was still the session
(instead of NULL) is a mistake. At one place a comparison is even made
between the session and conn->owner assuming it's valid without checking
if it's NULL. Let's clean this up to use the session all the time.
Note that this will be needed for a forthcoming fix and will have to be
backported.
HAVE_SSL_CTX_SET_CIPHERSUITES is newly defined macro set in openssl-compat.h,
which helps to identify ssl libs (currently OpenSSL-1.1.1 only) that supports
TLS13 cipersuites manipulation on TLS13 context
When a file from a crt-list was not found, this one was ignored silently
letting HAProxy starts without it.
This bug was introduced by 47da821 ("MEDIUM: ssl: emulates the
multi-cert bundles in the crtlist").
This commit adds a found variable which is checked once we tried every
bundle combination so we can exits with an error if none were found.
Must be backported in 2.3.
When a non-existing file was specified in the configuration, haproxy
does not exits with an error which is not normal.
This bug was introduced by dfa93be ("MEDIUM: ssl: emulate multi-cert
bundles loading in standard loading") which does nothing if the stat
failed.
This patch introduce a "found" variable which is checked at the end of
the function so we exit with an error if no find were found.
Must be backported to 2.3.
In issue #970 it was reported that the bundle loading does not work
anymore with crt-list.
This bug was introduced by 47da821 ("MEDIUM: ssl: emulates the
multi-cert bundles in the crtlist") which incorrectly uses "path"
instead of "crt_path" in the name resolution.
Must be backported to 2.3.
It is not possible on response comming from a server, but an errorfile may be
empty. In this case, the http-after-response ruleset must not be evaluated
because it is totally unexpected to manipulate headers on an empty HTX message.
This patch must be backported everywhere the http-after-response rules are
supported, i.e as far as 2.2.
In bug #959 it was reported that haproxy segfault on startup when trying
to load a certifcate which use the X509v3 AKID extension but without the
keyid field.
This field is not mandatory and could be replaced by the serial or the
DirName.
For example:
X509v3 extensions:
X509v3 Basic Constraints:
CA:FALSE
X509v3 Subject Key Identifier:
42:7D:5F:6C:3E:0D:B7:2C:FD:6A:8A:32:C6:C6:B9:90:05:D1:B2:9B
X509v3 Authority Key Identifier:
DirName:/O=HAProxy Technologies/CN=HAProxy Test Intermediate CA
serial:F2:AB:C1:41:9F:AB:45:8E:86:23:AD:C5:54:ED:DF:FA
This bug was introduced by 70df7b ("MINOR: ssl: add "issuers-chain-path" directive").
This patch must be backported as far as 2.2.
in the context of a progressive backend migration, we want to be able to
activate SSL on outgoing connections to the server at runtime without
reloading.
This patch adds a `set server ssl` command; in order to allow that:
- add `srv_use_ssl` to `show servers state` command for compatibility,
also update associated parsing
- when using default-server ssl setting, and `no-ssl` on server line,
init SSL ctx without activating it
- when triggering ssl API, de/activate SSL connections as requested
- clean ongoing connections as it is done for addr/port changes, without
checking prior server state
example config:
backend be_foo
default-server ssl
server srv0 127.0.0.1:6011 weight 1 no-ssl
show servers state:
5 be_foo 1 srv0 127.0.0.1 2 0 1 1 15 1 0 4 0 0 0 0 - 6011 - -1
where srv0 can switch to ssl later during the runtime:
set server be_foo/srv0 ssl on
5 be_foo 1 srv0 127.0.0.1 2 0 1 1 15 1 0 4 0 0 0 0 - 6011 - 1
Also update existing tests and create a new one.
Signed-off-by: William Dauchy <wdauchy@gmail.com>
a common init for ssl_ctx will be later usable in other functions in
order to support hot enable of ssl during runtime.
Signed-off-by: William Dauchy <wdauchy@gmail.com>
CONNECT requests are bodyless messages but with no EOM blocks. Thus, conditions
to stop waiting for the message payload are not suited to this kind of
messages. Indeed, the message finishes on an EOH block. But the tunnel mode at
the stream level is only set in HTTP_XFER_BODY analyser. So, the stream is
blocked, waiting for a body that does not exist till a timeout expires.
To fix this bug, we just stop waiting for a body for CONNECT requests. Another
solution is to rely on HTX_SL_F_BODYLESS/HTTP_MSGF_BODYLESS flags. But this one
is less intrusive.
This message must be backported as far as 2.0. For the 2.0, only the HTX part
must be fixed.
When http filtering ends, if there are some filtered data not forwarded yet, we
forward them, in flt_http_end(). Most of time, this doesn't happen, except when
a tunnel is established using a CONNECT. In this case, there is not EOM on the
request and there is no body. Thus the headers are never forwarded, blocking the
stream.
This patch must be backported as far as 2.0. Prior versions don't suffer of this
bug because there is no HTX support. On the 2.0, the change is only applicable
on HTX streams. A special test must be performed to make sure.
Return ERR_NONE instead of 0 on success for all config callbacks that should
return ERR_* codes. There is no change because ERR_NONE is a macro equals to
0. But this makes the return value more explicit.
post-check function callbacks must return ERR_* flags. Thus, init_h2() is fixed
to return ERR_NONE on success or (ERR_ALERT|ERR_FATAL) on error.
This patch may be backported as far as 2.2.
Functions registered to release memory per-thread have no return value. But the
registering function and the function pointer in per_thread_free_fct structure
specify it should return an integer. This patch fixes it.
This patch may be backported as far as 2.0.
When tcp-check or http-check rules are used, if the corresponding check option
(option tcp-check and option httpchk) is declared after the ruleset, a warning
is emitted about an unused check ruleset while there is no problem in reality.
This patch must be backported as far as 2.2.
In sync mode, if an applet receives a ack while the processing delay has already
expired, there is not frame waiting for this ack. But there is no reason to
close the connection in this case. The ack may be ignored and the connection may
be reused to process another frame. The only reason to trigger an error and
close the connection is when the wrong ack is received while there is still a
frame waiting for its ack. In sync mode, this should never happen.
This patch may be backported in all versions supporting the SPOE.
When a SPOE applet is used to send a frame, a reference on this applet is saved
in the spoe context of the offladed stream. But, if the applet is released
before receving the corresponding ack, we must be sure to remove this
reference. This was performed for fragmented frames only. But it must also be
performed for a spoe contexts in the applet waiting_queue and in the thread
waiting_queue (used in async mode).
This bug leads to a memory corruption when an offloaded stream try to update the
state of a released applet because it still have a reference on it. There are
many ways to trigger this bug. The easiest is probably during reloads. On the
old process, all applets are woken up to be released ASAP.
Many thanks to Maciej Zdeb to report the bug and to work on it for 2
months. Without his help, it would have been much more difficult to fix the
bug. It is always a huge pleasure to see how some users are enthousiast and
helpful. Thanks again Maciej !
This patch must be backported to all versions where the spoe is supported (>=
1.7).
First of all, this patch is tagged as a bug. But in fact, it only fixes a bug in
the 2.2. On the 2.3 and above, it only add the ability to display warnings, when
an http-error directive is parsed from a proxy section and when an errorfile
directive is parsed from a http-errors section.
But on the 2.2, it make sure to display the warning emitted on a content-length
mismatch when an errorfile is parsed. The following is only applicable to the
2.2.
commit "BUG/MINOR: http-htx: Just warn if payload of an errorfile doesn't match
the C-L" (which is only present in 2.2, 2.1 and 2.0 trees, i.e see commit
7bf3d81d3cf4b9f4587 in 2.2 tree), is changing the behavior of `http_str_to_htx`
function. It may now emit warnings. And, it is the caller responsibility to
display it.
But the warning is missing when an 'http-error' directive is parsed from
a proxy section. It is also missing when an 'errorfile' directive is
parsed from a http-errors section.
This bug only exists on the 2.2. On earlier versions, these directives
are not supported and on later ones, an error is triggered instead of a
warning.
Thanks to William Dauchy that spotted the bug.
This patch must be backported as far as 2.2.
Report an error when using an explicit proto for a connect rule with
non-compatible mode in regards with the selected check type (tcp-check
vs http-check).
If the check mux has been explicitly defined but is incompatible with
the selected check type (tcp-check vs http-check), report a warning and
prevent haproxy startup.
Only reuse the mux from server if the check is using the same mode.
For example, this prevents a tcp-check on a h2 server to select the h2
multiplexer instead of passthrough.
This bug was introduced by the following commit :
BUG/MEDIUM: checks: Use the mux protocol specified on the server line
It must be backported up to 2.2.
Fixes github issue #945.
req.cook, req.cook_val, req.cook_cnt and and their response counterparts may be
called without cookie name. In this case, empty parentheses may be used, or no
parentheses at all. In both, the result must be the same. But only the first one
works. The second one always returns a failure. This patch fixes this bug.
Note that on old versions (< 2.2), both cases fail.
This patch must be backported in all stable versions.
HTTP sample fetches dealing with the cookies (req/res.cook,
req/res.cook_val and req/res.cook_cnt) must be prepared to be called
without cookie name. For the first two, the first cookie value is
returned, regardless its name. For the last one, all cookies are counted.
To do so, http_extract_cookie_value() may now be called with no cookie
name (cookie_name_l set to 0). In this case, the matching on the cookie
name is ignored and the first value found is returned.
Note this patch also fixes matching on cookie values in ACLs.
This should be backported in all stable versions.
There is a bug in peer_recv_msg() due to an incorrect cast when trying
to decode the varint length of a stick-table message, causing lengths
comprised between 128 and 255 to consume one extra byte, ending in
protocol errors. The root cause of this is that peer_recv_msg() tries
hard to reimplement all the parsing and control that is already done in
intdecode() just to measure the length before calling it. And it got it
wrong.
Let's just get rid of this unneeded code duplication and solely rely on
intdecode() instead. The bug was introduced in 2.0 as part of a cleanup
pass on this code with commit 95203f218 ("MINOR: peers: Move high level
receive code to reduce the size of I/O handler."), so this patch must
be backported to 2.0.
Thanks to Yves Lafon for reporting the problem.
The TX part of a cache for a dictionary is made of an reserved array of ebtree nodes
which are pointers to dictionary entries. So when we flush the TX part of such a
cache, we must not only remove these nodes to dictionary entries from their ebtree.
We must also reset their values. Furthermore, the LRU key and the last lookup
result must also be reset.
If we could not decode the ID of a dictionary entry from a peer update message,
we must inform the remote peer about such an error as this is done for
any other decoding error.
Define a per-thread counters allocated with the greatest size of any
stat module counters. This variable is named trash_counters.
When using a proxy without allocated counters, return the trash counters
from EXTRA_COUNTERS_GET instead of a dangling pointer to prevent
segfault.
This is useful for all the proxies used internally and not
belonging to the global proxy list. As these objects does not appears on
the stat report, it does not matter to use the dummy counters.
For this fix to be functional, the extra counters are explicitly
initialized to NULL on proxy/server/listener init functions.
Most notably, the crash has already been detected with the following
vtc:
- reg-tests/lua/txn_get_priv.vtc
- reg-tests/peers/tls_basic_sync.vtc
- reg-tests/peers/tls_basic_sync_wo_stkt_backend.vtc
There is probably other parts that may be impacted (SPOE for example).
This bug was introduced in the current release and do not need to be
backported. The faulty commits are
"MINOR: ssl: count client hello for stats" and
"MINOR: ssl: add counters for ssl sessions".
Register a new function on POST DEINIT to free stats fields/lines for
each domain.
This patch does not fix a critical bug but may be backported to 2.3.
Do not cache responses that do not have an explicit expiration time
(s-maxage or max-age Cache-Control directives or Expires header) or a
validator (ETag or Last-Modified headers) anymore, as suggested in
RFC 7234#3.
The TX_FLAG_IGNORE flag is used instead of the TX_FLAG_CACHEABLE so as
not to change the behavior of the checkcache option.
This size is used by some pattern matching to determine if there
is sufficient room in the buffer to add final \0 if necessary.
If the size is not set, the conditions use uninitialized value.
Note: it seems this bug can't cause a crash.
Should be backported until 2.2 (at least)
The functions add final 0 to string if the final 0 is not set,
but don't check the flag CONST. This patch duplicates the strings
if the final zero is not set and the string is CONST.
Should be backported until 2.2 (at least)
In issue #940, it was reported that the crt-list does not work correctly
anymore. Indeed when inserting a crt-list line which use a certificate
previously seen in the crt-list, this one won't be inserted in the SNI
list and will be silently ignored.
This bug was introduced by commit 47da821 "MEDIUM: ssl: emulates the
multi-cert bundles in the crtlist".
This patch also includes a reg-test which tests this issue.
This bugfix must be backported in 2.3.
Commit a66adf41e ("MINOR: http-htx: Add understandable errors for the
errorfiles parsing") added a warning when loading malformed error files,
but this warning may trigger another build warning due to the %lu format
used. Let's simply cast it for output since it's just used for end user
output.
This must be backported to 2.0 like the commit above.
Since commit d0447a7c3 ("MINOR: ssl: add counters for ssl sessions"),
gcc 9+ complains about this:
CC src/ssl_sock.o
src/ssl_sock.c: In function 'ssl_sock_io_cb':
src/ssl_sock.c:5416:3: warning: 'counters_px' may be used uninitialized in this function [-Wmaybe-uninitialized]
5416 | ++counters_px->reused_sess;
| ^~~~~~~~~~~~~~~~~~~~~~~~~~
src/ssl_sock.c:5133:23: note: 'counters_px' was declared here
5133 | struct ssl_counters *counters, *counters_px;
| ^~~~~~~~~~~
Either a listener or a server are expected there, so ther counters are
always initialized and the compiler cannot know this. Let's preset
them and test before updating the counter, we're not in a hot path
here.
No backport is needed.
This function used to grab the idle lock when scanning the threads for
idle connections, but it doesn't need it since the lock only protects
the tree. Let's remove it.
No details are provided when an error occurs during the parsing of an errorfile,
Thus it is a bit hard to diagnose where the problem is. Now, when it happens, an
understandable error message is reported.
This patch is not a bug fix in itself. But it will be required to change an
fatal error into a warning in last stable releases. Thus it must be backported
as far as 2.0.
The default dh_param value is 2048 and it's preset to zero unless explicitly
set, so we must not report a warning about DH param not being loadble in 1024
bits when we're going to use 2048. Thanks to Dinko for reporting this.
This should be backported to 2.2.
Since commit 37bafdcbb ("MINOR: sock_inet: move the IPv4/v6 transparent mode code
to sock_inet"), build options for transparent proxying are registered twice.
This patch removes the older one.
It does not require heavy deletion from the expr anymore, so we can now
turn this to a single-linked list since most of the time we want to delete
all instances of a given pattern from the head. By doing so we save 32 bytes
of memory per pattern. The pat_unlink_from_head() function was adjusted
accordingly.
Instead of using LIST_DEL() on the pattern itself inside an expression,
we look it up from its head. The goal is to get rid of the double-linked
list while this usage remains exclusively for freeing on startup error!
Instead of scanning all elements from the expression and using the
slow delete path there, let's use the faster way which involves
pat_delete_gen() while the elements are detached from ther reference.
When purging all of a reference, it's much more efficient to scan the
reference patterns from the reference head and delete all derivative
patterns than to scan the expressions. The only thing is that we need
to proceed both for the current and next generations, in case there is
a huge gap between the two. With this, purging 20M IP addresses in small
batches of 100 takes roughly 3 seconds.
This function will be usable to purge at most a specified number of old
entries from a reference. Entries are declared old if their generation
number is in the past compared to the one passed in argument. This will
ease removal of early entries when new ones have been appended.
We also call malloc_trim() when available, at the end of the series,
because this is one place where there is a lot of memory to save. Reloads
of 1M IP addresses used in an ACL made the process grow up to 1.7 GB RSS
after 10 reloads and roughly stabilize there without this call, versus
only 260 MB when the call is present. Sadly there is no direct equivalent
for jemalloc, which stabilizes around 800MB-1GB.
pat_ref_load() basically combines pat_ref_append() and pat_ref_commit().
It's very similar to pat_ref_add() except that it also allows to set the
generation ID and the line number. pat_ref_add() was modified to directly
rely on it to avoid code duplication. Note that a previous declaration
of pat_ref_load() was removed as it was just a leftover of an earlier
incarnation of something possibly similar, so no existing functionality
was changed here.
This function will be used after a successful pat_ref_append() to propagate
the pattern to all use places (including parsing and indexing). On failure,
it will entirely roll back all insertions and free the pattern itself. It
also preserves the generation number so that it is convenient for use in
association with pat_ref_append(). pat_ref_add() was modified to rely on
it instead of open-coding the insertion and roll-back.
Instead of matching any pattern found in the tree, only match those
matching the current generation of entries. This will make sure that
reloads are atomic, regardless of the time they take to complete, and
that newly added data are not matched until the whole reference is
committed. For consistency we proceed the same way on "show map" and
"show acl".
This will have no impact for now since generations are not used.
Right now it's not possible to perform a safe reload because we don't
know what patterns were recently added or were already present. This
patch adds a generation counter to the reference patterns so that it
is possible to know what generation of the reference they were loaded
with. A reference now has two generations, the current one, used for
all additions, and the next one, allocated to those wishing to update
the contents. The generation wraps at 2^32 so comparisons must be made
relative to the current position.
The idea will be that upon full reload, the caller will first get a new
generation ID, will insert all new patterns using it, will then switch
the current ID to the new one, and will delete all entries older than
the current ID. This has the benefit of supporting chunked updates that
remain consistent and that won't block the whole process for ages like
pat_ref_reload() currently does.
Till now the only way to remove a known reference was via
pat_ref_delete_by_id() which scans the whole list to find a matching pointer.
Let's add pat_ref_delete_by_ptr() which takes a valid pointer. It can be
called by the function above after the pointer is found, and can also be
used to roll back a failed insertion much more efficiently.
These ones are not used anymore, so let's remove them to remove a bit
of the complexity. The ACL keyword's delete() function could be removed
as well, though most keyword declarations are positional and we have a
high risk of introducing a mistake here, so let's not touch the ACL part.
A few ACL keyword used to reference pat_delete_gen() as the deletion
function but this is not needed since it's the default one now. Let's
just remove this reference.
When we're removing an element under the expression lock, we don't need
anymore to run over all ->delete() functions via the expressions, since
we know that the single function does it fine now. Note that at this
point, pattern->delete() is not used at all through out the code anymore.
pat_del_tree_gen() was already chained onto pat_del_list_gen() to deal
with remaining cases, so let's complete the merge and have a generic
pattern deletion function acting on the reference and taking care of
reliably removing all elements.
This is the next step in speeding up entry removal. Now we don't scan
the whole lists or trees for elements pointing to the target reference,
instead we start from the reference and delete all linked patterns.
This simplifies some delete functions since we don't need anymore to
delete multiple times from an expression since all nodes appear after
the reference element. We can now have one generic list and one generic
tree deletion function.
This required the replacement of pattern_delete() with an open-coded
version since we now need to lock all expressions first before proceeding.
This means there is a high risk of lock inversion here but given that the
expressions are always scanned in the same order from the same head, this
must not happen.
Now deleting first entries is instantaneous, and it's still slow to
delete the last ones when looking up their ID since it still requires
to look them up by a full scan, but it's already way faster than
previously. Typically removing the last 10 IP from a 20M entries ACL
with a full-scan each took less than 2 seconds.
It would be technically possible to make use of indexed entries to
speed up most lookups for removal by value (e.g. IP addresses) but
that's for later.
There is a data model issue in the current pattern design that makes
pattern deletion extremely expensive: there's no direct way from a
reference to access all indexed occurrences. As such, the only way
to remove all indexed entries corresponding to a reference update
is to scan all expressions's lists and trees to find a link to the
reference. While this was possibly OK when map removal was not
common and most maps were small, this is not conceivable anymore
with GeoIP maps containing 10M+ entries and del-map operations that
are triggered from http-request rulesets.
This patch introduces two list heads from the pattern reference, one
for the objects linked by lists and one for those linked by tree node.
Ideally a single list would be enough but the linked elements are too
much unrelated to be distinguished at the moment, so we'll need two
lists. However for the long term a single-linked list will suffice but
for now it's not possible due to the way elements are removed from
expressions. As such this patch adds 32 bytes of memory usage per
reference plus 16 per indexed entry, but both will be cut in half
later.
The links are not yet used for deletion, this patch only ensures the
list is always consistent.
Now we have a single prune() function to act on an expression, and one
delete function for the lists and one for the trees. The presence of a
pointer in the lists is enough to warrant a free, and we rely on the
PAT_SF_REGFREE flag to decide whether to free using free() or regfree().
Currently we have no way to know how to delete/prune a pattern in a
generic way. A pattern doesn't contain its own type so we don't know
what function to call. Tree nodes are roughly OK but not lists where
regex are possible. Let's add one new bit for sflags at index time to
indicate that regex_free() will be needed upon deletion. It's not used
for now.
It's pointless to delete a backref and relink it to the next entry since
the next entry is going to do the exact same and so on until all of them
are deleted. Let's simply delete backrefs on reload.
It's not possible to uniquely update a single expression without updating
the pattern reference, I don't know why we've put the revision in the
expression back then, given that it in fact provides an update for a
full pattern. Let's move the revision into the reference's head instead.
This is one case where we may release large amounts of data at once. Tests
show that without this, after 10 full reloads of an ACL containing 1M IP
addresses, the memory usage grew and stabilized around 1.7 GB of RSS. With
this change, it stays around 260 MB and is stable across reloads.
This patch implements a couple of converters to validate and extract data from a
MQTT (Message Queuing Telemetry Transport) message. The validation consists of a
few checks as well as "packet size" validation. The extraction can get any field
from the variable header and the payload.
This is limited to CONNECT and CONNACK packet types only. All other messages are
considered as invalid. It is not a problem for now because only the first packet
on each side can be parsed (CONNECT for the client and CONNACK for the server).
MQTT 3.1.1 and 5.0 are supported.
Reviewed and Fixed by Christopher Faulet <cfaulet@haproxy.com>
This patch implements a couple of converters to validate and extract tag value
from a FIX (Financial Information eXchange) message. The validation consists in
a few checks such as mandatory fields and checksum computation. The extraction
can get any tag value based on a tag string or tag id.
This patch requires the istend() function. Thus it depends on "MINOR: ist: Add
istend() function to return a pointer to the end of the string".
Reviewed and Fixed by Christopher Faulet <cfaulet@haproxy.com>
Let us use SSL_CTX_set1_curves_list, defined by OpenSSL, as well as in
openssl-compat when SSL_CTRL_SET_CURVES_LIST is present (BoringSSL),
for feature detection instead of versions.
Following the patch b4daee ("MINOR: sock: add a check against cross
worker<->master socket activities"), this patch adds a dedicated applet
for the master CLI. It ensures that the CLI connection can't be
used with the master rights in the case of bugs.
In issue #933, @jaroslawr provided a report indicating that when using
many threads and many servers, it's very difficult to terminate the last
idle connections on each server. The issue has two causes in fact. The
first one is that during the calculation of the estimate of needed
connections, we round the computation up while in previous round it was
already rounded up, so we end up adding 1 to 1 which once divided by 2
remains 1. The second issue is that servers are not woken up anymore for
purging their connections if they don't have activity. The only reason
that was there to wake them up again was in case insufficient connections
were purged. And even then the purge task itself was not woken up. But
that is not enough for getting rid of the long tail of old connections
nor updating est_need_conns.
This patch makes sure to properly wake up as long as at least one idle
connection remains, and not to round up the needed connections anymore.
Prior to this patch, a test involving many connections which suddenly
stopped would keep many idle connections, now they're effectively halved
every pool-purge-delay.
This needs to be backported to 2.2.
Given that the previous issues caused spurious worker socket wakeups in
the master for inherited FDs that couldn't be closed, let's add a strict
test in the I/O callback to make sure that an accept() event is always
caught by the appropriate type of process (master for master listeners,
worker for worker listeners).
Since the h2 multiplexer no longer relies on the legacy HTTP representation, and
uses exclusively the HTX, the H1 parser state (h1m) is no longer used by the h2
streams. Thus it can be removed.
This patch may be backported as far as 2.1.
We used to refrain from calling fd_want_recv() if fd_updt was not allocated
but it's not the right solution as this does not allow the FD to be set.
Instead, let's use the new fd_want_recv_safe() which will update the FD and
create an update entry only if possible. In addition, the equivalent test
before calling fd_stop_recv() was removed as totally useless since there's
not fd_updt creation in this case.
In commit 374e9af35 ("MEDIUM: listener: let do_unbind_listener() decide
whether to close or not") it didn't appear necessary to have the master
process keep open the workers' inherited FDs. But this is actually
necessary to handle the reload on "bind fd@foo" situations, otherwise
the FD may be reassigned and the new socket cannot be set up, sometimes
causing "socket operation on non-socket" or other types of errors.
William found that this was the cause for the consistent failures of the
abns regtest, which already used to fail very often before this and was
as such marked as broken.
Interestingly I didn't have this issue with my test configs because
the FD number I used was higher and within the range of other listening
sockets. But this means that one of these wouldn't work as expected.
No backport is needed, this was introduced as part of the listeners
rework in 2.3.
It is not acceptable to suspend an inherited socket because we'd kill
its listening state, making it possibly unrecoverable for future
processes. The situation which can trigger this is when there is an
abns socket in a config and an inherited FD on another listener. Upon
soft reload, the abns fails to bind, a SIGTTOU is sent to the old
process which suspends everything, including the inherited FD, then
the new process can bind and tell the old one to quit. Except that the
new FD was not set back to the listen state, which is detected by
listener_accept() which can pause it. It's only upon second reload
that the FD works again.
The solution is to refrain from suspending such FDs since we don't own
them. And the next process will get them right anyway from its config.
For now only TCP and UDP face this issue so it's better to address this
on a protocol basis
No backport is needed, this is related to the new listeners in 2.3.
The test on listener->state == LI_LISTEN is not sufficient to decide
if we need to enable a listener. Indeed, there is a very special case
which is the inherited FD shared, which has to reflect the real socket
state even after the previous test, and as such needs to remain in
LI_LISTEN state. In this case we don't want a worker to start the
master's listener nor conversely. Let's add a specific test for this.
An interesting case was reported with threads and moderately sized
stick-tables. Sometimes the watchdog would trigger during the purge.
It turns out that the stick tables were sized in the 10s of K entries
which is the order of magnitude of the possible number of connections,
and that threads were used over distinct NUMA nodes. While at first
glance nothing looks problematic there, actually there is a risk that
a thread trying to purge the table faces 100% of entries still in use
by a connection with (ts->ref_cnt > 0), and ends up scanning the whole
table, while other threads on the other NUMA node are causing the
cache lines to bounce back and forth and considerably slow down its
progress to the point of possibly spending hundreds of milliseconds
there, multiplied by the number of queued threads all failing on the
same point.
Interestingly, smaller tables would not trigger it because the scan
would be faster, and larger ones would not trigger it because plenty
of entries would be idle!
The most efficient solution is to increase the table size to be large
enough for this never to happen, but this is not reliable. We could
have a parallel list of idle entries but that would significantly
increase the storage and processing cost only to improve a few rare
corner cases.
This patch takes a more pragmatic approach, it considers that it will
not visit more than twice the number of nodes to be deleted, which
means that it accepts to fail up to 50% of the time. Given that very
small batches are programmed each time (1/256 of the table size), this
means the operation will finish quickly (128 times faster than now),
and will reduce the inter-thread contention. If this needs to be
reconsidered, it will probably mean that the batch size needs to be
fixed differently.
This needs to be backported to stable releases which extensively use
threads, typically 2.0.
Kudos to Nenad Merdanovic for figuring the root cause triggering this!
This partially reverts the patch 400829cd2 ("BUG/MEDIUM: filters: Don't try to
init filters for disabled proxies"). Disabled proxies must not be skipped in
flt_deinit() and flt_deinit_all_per_thread() when HAProxy is stopped because,
obvioulsy, at this step, all proxies appear as disabled (or stopped, it is the
same state). It is safe to do so because, during startup, filters declared on
disabled proxies are removed. Thus they don't exist anymore during shutdown.
This patch must be backported in all versions where the patch above is.
When a TCP connection is upgraded to HTTP, the passthrough multiplexer owning
the client connection is detroyed and replaced by an HTTP multiplexer. When it
happens, the connection context is changed (it is in fact the mux itself). Thus,
when the mux-pt is destroyed, the connection is not released. But, only the
connection must be kept. Everything else concerning the mux must be
released. Especially, the tasklet used for I/O subscriptions. In this part,
there was a bug and the tasklet was never released.
This patch should fix the issue #935. It must be backported as far as 2.0.
When servers based on server templates are initialized, the configuration file
and line are now copied. This helps to emit understandable warning and alert
messages.
This patch may be backported if needed, as far as 1.8.
On startup, if a server has no address but the dns resolutions are configured,
"none" method is added to the default init-addr methods, in addition to "last"
and "libc". Thus on startup, this server is set to RMAINT mode if no address is
found. It is only performed if no other init-addr method is configured.
Setting the RMAINT mode on startup is important to inhibit the health checks.
For instance, following servers will now be set to RMAINT mode on startup :
server srv nofound.tld:80 check resolvers mydns
server srv _http._tcp.service.local check resolvers mydns
server-template srv 1-3 _http._tcp.service.local check resolvers mydns
while followings ones will trigger an error :
server srv nofound.tld:80 check
server srv nofound.tld:80 check resolvers mydns init-addr libc
server srv _http._tcp.service.local check
server srv _http._tcp.service.local check resolvers mydns init-addr libc
server-template srv 1-3 _http._tcp.service.local check resolvers mydns init-addr libc
This patch must be backported as far as 1.8.
When a health-check fails, if no connection attempt was performed, a socket
error must be reported. But this was only done if the connection was not
allocated. It must also be done if there is no control layer. Otherwise, a
L7TOUT will be reported instead.
It is possible to not having a control layer for a connection if the connection
address family is invalid or not defined.
This patch must be backported to 2.2.
per-proxy and per-server post-check callback functions must be skipped for
disabled proxies because most of the configuration validity check is skipped for
these proxies.
This patch must be backported as far as 2.1.
Configuration is parsed for such proxies but not validated. Concretely, it means
check_config_validity() function does almost nothing for such proxies. Thus, we
must be careful to not initialize filters for disabled proxies because the check
callback function is not called. In fact, to be sure to avoid any trouble,
filters for disabled proxies are released.
This patch fixes a segfault at startup if the SPOE is configured for a disabled
proxy. It must be backported as far as 1.7 (maybe with some adaptations).
let us use SSL_CTRL_GET_RAW_CIPHERLIST for feature detection instead
of versions
[wla: SSL_CTRL_GET_RAW_CIPHERLIST was introduced by OpenSSL commit
94a209 along with SSL_CIPHER_find. It was removed in boringSSL.]
Signed-off-by: William Lallemand <wlallemand@haproxy.org>
The code is horrible to work with because most functions are documented
with misleading comments resulting from many spelling and grammatical
mistakes, and plenty of remains of copy-paste mentioning arguments that
do not exist and return values that are never set. Too many hours wasted
writing non-working code because of assumptions resulting from this,
let's fix this once for all now!
It's particularly difficult to make sure that the various pattern
structures are properly initialized given that they can be allocated
at multiple places and systematically via malloc() instead of calloc(),
thus not even leaving the possibility of default values. Let's adjust
a few of them.
It's more convenient to return the element than to return just 0 or 1,
as the next thing we'll want to do is to act on this element! In addition
it was using variable arguments instead of consts, causing some reuse
constraints which were also addressed. This doesn't change its use as
a boolean, hence why call places were not modified.
The maxage and smaxage variables were inadvertently assigned the
Cache-Control s-maxage and max-age values respectively when it should
have been the other way around.
This can be backported on all branches after 1.8 (included).
If an HTTP request or response had a "Cache-Control" header that had
multiple comma-separated subparts in its value (like "max-age=1,
no-store" for instance), we did not process the values correctly and
only parsed the first one. That made us store some HTTP responses in the
cache when they were explicitely uncacheable.
This patch replaces the way the values are parsed by an http_find_header
loop that manages every sub part of the value independently.
This patch should be backported to 2.2 and 2.1. The bug also exists on
previous versions but since the sources changed, a new commit will have
to be created.
[wla: This patch requires bb4582c ("MINOR: ist: Add a case insensitive
istmatch function"). Backporting for < 2.1 is not a requirement since it
works well enough for most cases, it was a known limitation of the
implementation of non-htx version too]
When no Cache-Control max-age or s-maxage information is present in a
cached response, we need to parse the Expires header value (RFC 7234#5.3).
An invalid Expires date value or a date earlier than the reception date
will make the cache_entry stale upon creation.
For now, the Cache-Control and Expires headers are parsed after the
insertion of the response in the cache so even if the parsing of the
Expires results in an already stale entry, the entry will exist in the
cache.
Memset the sample before using it through hlua_lua2smp. This function is
ORing the smp.flags, so this field need to be cleared before its use.
This was reported by a coverity warning.
Fixes the github issue #929.
This bug can be backported up to 1.8.
Adjust condition used to report down_time for statistics. There was a
tiny probabilty to have a negative downtime if last_change was superior
to now. If this is the case, return only down_time.
This bug can backported up to 1.8.
When a server is up after a failure, its downtime was reset to 0 on the
statistics. This is due to a wrong condition that causes srv.down_time
to never be set. Fix this by updating down_time each time the server is in
STARTING state.
Fixes the github issue #920.
This bug can be backported up to 1.8.
Implement counters for h2 protocol error on connection or stream level.
Also count the total number of rst_stream and goaway frames sent by the
mux in response to a detected error.
Add pointer to counters as a member for h2c structure. This pointer is
initialized on h2_init function. This is useful to quickly access and
manipulate the counters inside every h2 functions.
Res.cache_hit sample fetch returns a boolean which is true when the HTTP
response was built out of a cache. The cache's name is returned by the
res.cache_name sample_fetch.
This resolves GitHub issue #900.
If a client sends a conditional request containing an If-Modified-Since
header (and no If-None-Match header), we try to compare the date with
the one stored in the cache entry (coming either from a Last-Modified
head, or a Date header, or corresponding to the first response's
reception time). If the request's date is earlier than the stored one,
we send a "304 Not Modified" response back. Otherwise, the stored is sent
(through a 200 OK response).
This resolves GitHub issue #821.
In order to manage "If-Modified-Since" requests, we need to keep a
reference time for our cache entries (to which the conditional request's
date will be compared).
This reference is either extracted from the "Last-Modified" header, or
the "Date" header, or the reception time of the response (in decreasing
order of priority).
The date values are converted into seconds since epoch in order to ease
comparisons and to limit storage space.
BorinSSL pretends to be 1.1.1 version of OpenSSL. It messes some
version based feature presense checks. For example, OpenSSL specific
early data support.
Let us change that feature detction to SSL_READ_EARLY_DATA_SUCCESS
macro check instead of version comparision.
Previous commit ae32ac74db ("BUG/MINOR: log: fix memory leak on logsrv
parse error") addressed one issue and introduced another one, the logsrv
pointer may also be null at the end of the function so we must test it
before deciding to dereference it.
This should be backported along with the patch above to 2.2.
In case of parsing error on logsrv, we can leave parse_logsrv() without
releasing logsrv->ring_name or smp_rgs. Let's free them on the error path.
This should fix issue #926 detected by Coverity.
The impact is only a tiny leak just before reporting a fatal error, so it
will essentially annoy valgrind.
This can be backported to 2.0 (just drop the ring part).
It's a regression from b3201a3e "BUG/MINOR: disable dynamic OCSP load
with BoringSSL". The origin bug is link to 76b4a12 "BUG/MEDIUM: ssl:
memory leak of ocsp data at SSL_CTX_free()": ssl_sock_free_ocsp()
shoud be in #ifndef OPENSSL_IS_BORINGSSL.
To avoid long #ifdef for small code, the BoringSSL part for ocsp load
is isolated in a simple #ifdef.
This must be backported in 2.2 and 2.1
`att_beg` is assigned to `next` at the end of the `for` loop, but is
assigned to `prev` at the beginning of the loop, which is itself
assigned to `next` after each loop. So it represents a double
assignation for the same value. Also `att_beg` is not used after the end
of the loop.
this is a partial fix for github issue #923, all the others could
probably be marked as intentional to protect future changes.
no backport needed.
Signed-off-by: William Dauchy <wdauchy@gmail.com>
Issue #910 reports that we fail to check a few extchk_setenv() in the
child process. These are mostly harmless, but instead of counting on
the external check script to fail the dirty way, better fail cleanly
when detecting the failure.
This could probably be backported to all stable branches.
As reported by Coverity in issue #917, commit 96bca33 ("OPTIM: queue:
decrement the nbpend and totpend counters outside of the lock")
introduced a bug when moving the increments outside of the loop,
because we can't always rely on the pendconn "p" here as it may
be null. We can retrieve the proxy pointer directly from s->proxy
instead. The same is true for pendconn_redistribute(), though the
last "p" pointer there was still valid. This patch fixes both.
No backport is needed, this was introduced just before 2.3-dev8.
The "weight" column on the stats page is somewhat confusing when using
slowstart becaue it reports the effective weight, without being really
explicit about it. In some situations the user-configured weight is more
relevant (especially with long slowstarts where it's important to know
if the configured weight is correct).
This adds a new uweight stat which reports a server's user-configured
weight, and in a backend it receives the sum of all servers' uweights.
In addition it adds the mention of "effective" in a few descriptions
for the "weight" column (help and doc).
As a result, the list of servers in a backend is now always scanned
when dumping the stats. But this is not a problem given that these
servers are already scanned anyway and for way heavier processing.
In order to be compatible with the "set ssl cert" command of the CLI,
this patch restrict the ssl-load-extra-del-ext to files with a ".crt"
extension in the configuration.
Related to issue #785.
Should be backported where 8e8581e ("MINOR: ssl: 'ssl-load-extra-del-ext'
removes the certificate extension") was backported.
When dumping the stats page (or the CSV output), when many states are
mixed, it's hard to figure the number of up servers. But when showing
only the "up" servers or hiding the "maint" servers, there's no way to
know how many servers are configured, which is problematic when trying
to update server-templates.
What this patch does, for dumps in "up" or "no-maint" modes, is to add
after the backend's "UP" or "DOWN" state "(%d/%d)" indicating the number
of servers seen as UP to the total number of servers in the backend. As
such, seeing "UP (33/39)" immediately tells that there are 6 servers that
are not listed when using "up", or will let the client figure how many
servers are left once deducted the number of non-maintenance ones. It's
not done on default dumps so as not to disturb existing tools, which
already have all the information they need in the dump.
"no-maint" is a bit similar to "up" except that it will only hide
servers that are in maintenance (or disabled in the configuration), and
not those that are enabled but failed a check. One benefit here is to
significantly reduce the output of the "show stat" command when using
large server-templates containing entries that are not yet provisioned.
Note that the prometheus exporter also has such an option which does
the exact same.
We already had it on the HTTP interface but it was not accessible on the
CLI. It can be very convenient to hide servers which are down, do not
resolve, or are in maintenance.
Leastconn has the nice propery of being able to sort servers by their
current usage. It's really a shame to force all requests into the backend
queue when the algo would be able to also consider their current queue.
In order not to change existing behavior but extend it, this patch allows
leastconn to elect servers which are already full if they have an explicitly
configured maxqueue setting above zero and their queue hasn't reached that
threshold. This will significantly reduce the pressure in the backend queue
when queuing a lot with lots of servers.
A test on 8 threads with 100 servers configured with maxconn 1 jumped
from 165krps to 330krps with maxqueue 15 with this patch.
This partially undoes commit 82cd5c13a ("OPTIM: backend: skip LB when we
know the backend is full") but allows to scale much better even by setting
a single-digit maxqueue value. Some better heuristics could be used to
maintain the behavior of the bypass in the patch above, consisting in
keeping it if it's known that there is no server with a configured
maxqueue in the farm (or in the backend).
When servers are queued into the leastconn tree, it's important to also
consider their queue length. There could be some servers with lots of
queued requests that we don't want to hammer with extra connections. In
order not to add extra stress to the LB algorithm, we don't update the
value when adding to the queue, only when updating the connection count
(i.e. picking from the queue or releasing a connection). This will be
sufficient to significantly improve the fairness in such situations.
We don't need to do that inside the lock. However since the operation
used to be done in deep functions, we have to make it resurface closer
to visible parts. It remains reasonably self-contained in queue.c so
that's not that big of a deal. Some places (redistribute) could benefit
from a single operation for all counts at once. Others like
pendconn_process_next_strm() are still called with both locks held but
now it will be possible to change this.
Instead of incrementing, decrementing them and updating their max under
the lock, make them atomic and keep them out of the lock as much as
possible. For __pendconn_unlink_* it would be wide to decide to move
these counters outside of the function, inside the callers so that a
single atomic op can be done per counter even for groups of operations.
Similarly to previous changes, we know if we're dealing with a server
or proxy lock so let's directly lock at the finest possible places
there. It's worth noting that a part of the operation consisting in
an increment and update of a max could be done outside of the lock
using atomic ops and a CAS.
The function is called with the lock held and does too many tests for
things that are already known from its callers. Let's split it in two
so that its callers call either the per-server or per-proxy function
depending on where the element is (since they had to determine it
prior to taking the lock).
No need to use an exclusive lock on the proxy anymore when reading its
setting, a read lock is enough. A few other places continue to use a
write-lock when modifying simple flags only in order to let this
function see a consistent value all along. This might be changed in
the future using barriers and local copies.
This is an anticipation of finer grained locking for the queues. For now
all lock places take a write lock so that there is no difference at all
with previous code.
In h2_send(), if we are in a state where we know it is no longer possible to
send data, we must exit the sending loop to avoid any possiblity to loop
forever. It may happen if the mbuf ring is released while the H2_CF_MUX_MFULL
flag is still set. Here is a possible scenario to trigger the bug :
1) The mbuf ring is full because we are unable to send data. The
H2_CF_MUX_MFULL flag is set on the H2 connection.
2) At this stage, the task timeout expires because the H2 connection is
blocked. We enter in h2_timeout_task() function. Because the mbuf ring is
full, we cannot send the GOAWAY frame. Thus the H2_CF_GOAWAY_FAILED flag is
set. The H2 connection is not released yet because there is still a stream
attached. Here we leave h2_timeout_task() function.
3) A bit later, the H2 connection is woken up. If h2_process(), nothing is
performed by the first attempt to send data, in h2_send(). Then, because
the H2_CF_GOAWAY_FAILED flag is set, the mbuf ring is released. But the
H2_CF_MUX_MFULL flag is still there. At this step a second attempt to send
data is performed.
4) In h2_send(), we try to send data in a loop. To exist this loop, done
variable must be set to 1. Because the H2_CF_MUX_MFULL flag is set, we
don't call h2_process_mux() and done is not updated. Because the mbuf ring
is now empty, nothing is sent and the H2_CF_MUX_MFULL flag is never
removed. Now, we loop forever... waiting for the watchdog.
To fix the bug, we now exit the loop if one of these conditions is true :
- The H2_CF_GOAWAY_FAILED flag is set on the H2 connection
- The CO_FL_SOCK_WR_SH flag is set on the underlying connection
- The H2 connection is in the H2_CS_ERROR2 state
This patch should fix the issue #912 and most probably #875. It must be
backported as far as the 1.8.
When an internal response is returned to a client, the message payload must be
skipped if it is a reply to a HEAD request. The payload is removed from the HTX
message just before the message forwarding.
This bugs has been around for a long time. It was already there in the pre-HTX
versions. In legacy HTTP mode, internal errors are not parsed. So this bug
cannot be easily fixed. Thus, this patch should only be backported in all HTX
versions, as far as 2.0. However, the code has significantly changed in the
2.2. Thus in the 2.1 and 2.0, the patch must be entirely reworked.
Partial support of conditional HTTP requests. This commit adds the
support of the 'If-None-Match' header (see RFC 7232#3.2).
When a client specifies a list of ETags through one or more
'If-None-Match' headers, they are all compared to the one that might have
been stored in the corresponding http cache entry until one of them
matches.
If a match happens, a specific "304 Not Modified" response is
sent instead of the cached data. This response has all the stored
headers but no other data (see RFC 7232#4.1). Otherwise, the whole cached data
is sent.
Although unlikely in a GET/HEAD request, the "If-None-Match: *" syntax is
valid and also receives a "304 Not Modified" response (RFC 7434#4.3.2).
This resolves a part of GitHub issue #821.
When sent by a server for a given resource, the ETag header is
stored in the coresponding cache entry (as any other header). So in
order to perform future ETag comparisons (for subsequent conditional
HTTP requests), we keep the length of the ETag and its offset
relative to the start of the cache_entry.
If no ETag header exists, the length and offset are zero.
Add a function that compares two etags that might be of different types.
If any of them is weak, the 'W/' prefix is discarded and a strict string
comparison is performed.
Co-authored-by: Tim Duesterhus <tim@bastelstu.be>
If the slowstart value in a state file implies the latest state change
is within the slowstart period, we end up calling srv_update_status()
to reschedule the server's state change but its task is not yet
allocated and remains null, causing a crash on startup.
Make sure srv_update_status() supports being called with partially
initialized servers which do not yet have a task. If the task has to
be scheduled, it will necessarily happen after initialization since
it will result from a state change.
This should be backported wherever server-state is present.
In commit 5cd4bbd7a ("BUG/MAJOR: threads/queue: Fix thread-safety issues
on the queues management") the counter of transferred connections was
accidently lost, so that when a server goes down with connections in its
queue, it will always be reported that 0 connection were transferred.
This should be backported as far as 1.8 since the patch above was
backported there.
In issue #785, users are reporting that it's not convenient to load a
".crt.key" when the configuration contains a ".crt".
This option allows to remove the extension of the certificate before
trying to load any extra SSL file (.key, .ocsp, .sctl, .issuer etc.)
The patch changes a little bit the way ssl_sock_load_files_into_ckch()
looks for the file.
safer to close handle before the object is put back in the global pool.
this was introduced by commit 9378bbe0be ("MEDIUM: listener:
use protocol->accept_conn() to accept a connection")
this should fix github issue #902
no backport needed.
Signed-off-by: William Dauchy <wdauchy@gmail.com>
As previously discussed, nbproc usage is bad, deprecated, and scheduled
for removal in 2.5.
If "nbproc" is found with more than one process while nbthread is not
set, a warning will be emitted encouraging to remove it or to migrate
to nbthread instead. This makes sure the user has an opportunity to
both see the message and silence it.
This counter is only updated and never used, and in addition it's done
without any atomicity so it's very unlikely to be correct on multi-CPU
systems! Let's just remove it since it's not used.
It's a bit overkill to register an initcall to call a function to set
a lock to zero when not debugging, let's just declare the lock as
pre-initialized to zero.
When using a low hash-balance-factor value, it's possible to loop
many times trying to find the best server. Figures in the order of
100-300 times were observed for 1000 servers with a factor of 101
(which seems a bit excessive for such a large farm). Given that
there's nothing in that function that prevents multiple threads
from working in parallel, let's switch to a read lock. Tests on
8 threads show roughly a 2% performance increase with this.
The "first" algorithm creates a lot of contention because all threads
focus on the same server by definition (the first available one). By
turning the exclusive lock to a read lock in fas_get_next_server(),
the request rate increases by 16% for 8 threads when many servers are
getting close to their maxconn.
This function doesn't change the tree, it only looks for the first
usable server, so let's do that under a read lock to limit the
situations like the ones described in issue #881 where finding a
usable server when dealing with lots of saturated ones can be
expensive. At least threads will now be able to look up in
parallel.
It's interesting to note that s->served is not incremented during the
server choice, nor is the server repositionned. So right now already,
nothing prevents multiple threads from picking the same server. This
will not cause a significant imbalance anyway given that the server
will automatically be repositionned at the right place, but this might
be something to improve in the future if it doesn't come with too high
a cost.
It also looks like the way a server's weight is updated could be
revisited so that the write lock gets tighter at the expense of a
short part of inconsistency between weights and servers still present
in the tree.
- map_get_server_hash() doesn't need a write lock since it only
reads the array, let's only use a read lock here.
- map_get_server_rr() only needs exclusivity to adjust the rr_idx
while looking for its entry. Since this one is not used by
map_get_server_hash(), let's turn this lock to a seek lock that
doesn't block reads.
With 8 threads, no significant performance difference was noticed
given that lookups are usually instant with this LB algo so the
lock contention is rare.
It was previously a spinlock, and it happens that a number of LB algos
only lock it for lookups, without performing any modification. Let's
first turn it to an rwlock and w-lock it everywhere. This is strictly
identical.
It was carefully checked that every HA_SPIN_LOCK() was turned to
HA_RWLOCK_WRLOCK() and that HA_SPIN_UNLOCK() was turned to
HA_RWLOCK_WRUNLOCK() on this lock. _INIT and _DESTROY were updated too.
The server lock must be held when server_take_conn() and server_drop_conn()
lbprm callback functions are called. It is a documented prerequisite but it is
not always performed. It only affects leastconn and fas lb algorithm. Others
don't use these callback functions.
A race condition on the next pending effecive weight (next_eweight) may be
encountered with the leastconn lb algorithm. An agent check may set it to 0
while fwlc_srv_reposition() is called. The server is locked during the
next_eweight update. But because the server lock is not acquired when
fwlc_srv_reposition() is called, we may use it to recompute the server key,
leading to a division by 0.
This patch must be backported as far as 1.8.
It is not guaranteed that the backend connection has an owner. It is set when
the connection is created. But when the connection is moved in a server idle
list, the connection owner is set to NULL and may never be set again. On the
other hand, when a mux is created or when a CS is attached, the session is
always defined. The H1 stream always keep a reference on it when it is
created. Thus, when a bad message is captured we should not rely on the
connection owner to retrieve the session. Instead we should get it from the H1
stream.
If an agent try to set a variable with the NULL data type, an unset is perform
instead to avoid undefined behaviors. Once decoded, such data are translated to
a sample with the type SMP_T_ANY. It is unexpected in HAProxy. When a variable
is set with such sample, no data are attached to the variable. Thus, when the
variable is retrieved later in the transaction, the sample data are
uninitialized, leading to undefined behaviors depending on how it is used. For
instance, it leads to a crash if the debug converter is used on such variable.
This patch should fix the issue #855. It must be backported as far as 1.8.
Detect if the sni used a constant value and if so, allow to reuse this
connection for later sessions. Use a combination of SMP_USE_INTRN +
!SMP_F_VOLATILE to consider a sample as a constant value.
This features has been requested on github issue #371.
During a peers session collision (two peer sessions opened on both side) we must
mark the peer the session of which will be shutdown as alive, if not ->reconnect
timer will be set with a wrong value if the synchro task expires after the peer
has been reconnected. This possibly leads to unexpected deconnections during handshakes.
Furthermore, this patch cancels any heartbeat tranmimission when a reconnection
is prepared.
Right now when running a configuration with many global timers (e.g. many
health checks), there is a lot of contention on the global wait queue
lock because all threads queue up in front of it to scan it.
With 2000 servers checked every 10 milliseconds (200k checks per second),
after 23 seconds running on 8 threads, the lock stats were this high:
Stats about Lock TASK_WQ:
write lock : 9872564
write unlock: 9872564 (0)
wait time for write : 9208.409 msec
wait time for write/lock: 932.727 nsec
read lock : 240367
read unlock : 240367 (0)
wait time for read : 149.025 msec
wait time for read/lock : 619.991 nsec
i.e. ~5% of the total runtime spent waiting on this specific lock.
With upgradable locks we don't need to work like this anymore. We
can just try to upgade the read lock to a seek lock before scanning
the queue, then upgrade the seek lock to a write lock for each element
we want to delete there and immediately downgrade it to a seek lock.
The benefit is double:
- all other threads which need to call next_expired_task() before
polling won't wait anymore since the seek lock is compatible with
the read lock ;
- all other threads competing on trying to grab this lock will fail
on the upgrade attempt from read to seek, and will let the current
lock owner finish collecting expired entries.
Doing only this has reduced the wake_expired_tasks() CPU usage in a
very large servers test from 2.15% to 1.04% as reported by perf top,
and increased by 3% the health check rate (all threads being saturated).
This is expected to help against (and possibly solve) the problem
described in issue #875.
There is a theorical problem in the wait queue, which is that with many
threads, one could spend a lot of time looping on the newly expired tasks,
causing a lot of contention on the global wq_lock and on the global
rq_lock. This initially sounds bening, but if another thread does just
a task_schedule() or task_queue(), it might end up waiting for a long
time on this lock, and this wait time will count on its execution budget,
degrading the end user's experience and possibly risking to trigger the
watchdog if that lasts too long.
The simplest (and backportable) solution here consists in bounding the
number of expired tasks that may be picked from the global wait queue at
once by a thread, given that all other ones will do it as well anyway.
We don't need to pick more than global.tune.runqueue_depth tasks at once
as we won't process more, so this counter is updated for both the local
and the global queues: threads with more local expired tasks will pick
less global tasks and conversely, keeping the load balanced between all
threads. This will guarantee a much lower latency if/when wakeup storms
happen (e.g. hundreds of thousands of synchronized health checks).
Note that some crashes have been witnessed with 1/4 of the threads in
wake_expired_tasks() and, while the issue might or might not be related,
not having reasonable bounds here definitely justifies why we can spend
so much time there.
This patch should be backported, probably as far as 2.0 (maybe with
some adaptations).
The proxy stopping mechanism was changed with commit 322b9b94e ("MEDIUM:
proxy: make stop_proxy() now use stop_listener()") so that it's now
entirely driven by the listeners. One thing was forgotten though, which
is that pure backends will not stop anymore since they don't have any
listener, and that it's necessary to stop them in order to stop the
health checks.
No backport is needed.
We don't need to specify the handler anymore since it's set in the
receiver. Let's remove this argument from the function and clean up
the remains of code that were still setting it.
Now we define a new sock_accept_iocb() for socket-based stream protocols
and use it as a wrapper for listener_accept() which now takes a listener
and not an FD anymore. This will allow the receiver's I/O cb to be
redefined during registration, and more specifically to get rid of the
hard-coded hacks in protocol_bind_all() made for syslog.
The previous ->accept() callback in the protocol was removed since it
doesn't have anything to do with accept() anymore but is more generic.
A few places where listener_accept() was compared against the FD's IO
callback for debugging purposes on the CLI were updated.
For now we're still using the protocol's default accept() function as
the I/O callback registered by the receiver into the poller. While
this is usable for most TCP connections where a listener is needed,
this is not suitable for UDP where a different handler is needed.
Let's make this configurable in the receiver just like the upper layer
is configurable for listeners. In order to ease stream protocols
handling, the protocols will now provide a default I/O callback
which will be preset into the receivers upon allocation so that
almost none of them has to deal with it.
The receiver FDs must not be manipulated by the listener_accept()
function anymore, it must exclusively rely on the job performed by
its listeners, as it is also the only way to keep the receivers
working for established connections regardless of the listener's
state (typically for multiplexed protocols like QUIC). This used
to be necessary when the FDs were adjusted at once only but now
that fd_done() is gone and the need for polling enabled by the
accept_conn() function which detects the EAGAIN, we have nothing
to do there to fixup any possible previous bad decision anymore.
Interestingly, as a side effect of making the code not depend on
the FD anymore, it also removes the need for a second lock, which
increase the accept rate by about 1% on 8 threads.
Now listener_accept() doesn't have to deal with the incoming FD anymore
(except for a little bit of side band stuff). It directly retrieves a
valid connection from the protocol layer, or receives a well-defined
error code that helps it decide how to proceed. This removes a lot of
hardly maintainable low-level code and opens the function to receive
new protocol stacks.
This is the same as previous commit, but this time for the sockpair-
specific stuff, relying on recv_fd_uxst() instead of accept(), so the
code is simpler. The various errno cases are handled like for regular
sockets, though some of them will probably never happen, but this does
not hurt.
The socket-specific accept() code in listener_accept() has nothing to
do there. Let's move it to sock.c where it can be significantly cleaned
up. It will now directly return an accepted connection and provide a
status code instead of letting listener_accept() deal with various errno
values. Note that this doesn't support the sockpair specific code.
The function is now responsible for dealing with its own receiver's
polling state and calling fd_cant_recv() when facing EAGAIN.
One tiny change from the previous implementation is that the connection's
sockaddr is now allocated before trying accept(), which saves a memcpy()
of the resulting address for each accept at the expense of a cheap
pool_alloc/pool_free on the final accept returning EAGAIN. This still
apparently slightly improves accept performance in microbencharks.
This call was introduced by commit 5ced3e887 ("MINOR: sock: add
sock_accept_conn() to test a listening socket") but is actually quite
confusing because it makes one think the socket will accept a connection
(which is what we want to have in a new function) while it only tells
whether it's configured to accept connections. Let's call it
sock_accepting_conn() instead.
The same change was applied to sockpair which had the same issue.
Now that this function is always called with an initialized connection
and that the control layer is always initialized, we don't need to play
games with fdtab[] to decide how to close, we can simply rely on the
regular close path using conn_ctrl_close(), which can be fused with
conn_xprt_close() into conn_full_close().
The code is cleaner because the FD is now used only for some
protocol-specific setup (that will eventually have to move) and to
try to send a hard-coded HTTP 500 error message on raw sockets.
Till now we would keep a per-thread queue of pending incoming connections
for which we would store:
- the listener
- the accepted FD
- the source address
- the source address' length
And these elements were first used in session_accept_fd() running on the
target thread to allocate a connection and duplicate them again. Doing
this induces various problems. The first one is that session_accept_fd()
may only run on file descriptors and cannot be reused for QUIC. The second
issue is that it induces lots of memory copies and that the listerner
queue thrashes a lot of cache, consuming 64 bytes per entry.
This patch changes this by allocating the connection before queueing it,
and by only placing the connection's pointer into the queue. Indeed, the
first two calls used to initialize the connection already store all the
information above, which can be retrieved from the connection pointer
alone. So we just have to pop one pointer from the target thread, and
pass it to session_accept_fd() which only needs the FD for the final
settings.
This starts to make the accept path a bit more transport-agnostic, and
saves memory and CPU cycles at the same time (1% connection rate increase
was noticed with 4 threads). Thanks to dividing the accept-queue entry
size from 64 to 8 bytes, its size could be increased from 256 to 1024
connections while still dividing the overall size by two. No single
queue full condition was met.
One minor drawback is that connection may be allocated from one thread's
pool to be used into another one. But this already happens a lot with
connection reuse so there is really nothing new here.
Roughly half of the calls to sockadr_alloc() are made to copy an already
known address. Let's optionally pass it in argument so that the function
can handle the copy at the same time, this slightly simplifies its usage.
fd_done_recv() used to be useful with the FD cache because it used to
allow to keep a file descriptor active in the poller without being
marked as ready in the cache, saving it from ringing immediately,
without incurring any system call. It was a way to make it yield
to wait for new events leaving a bit of time for others. The only
user left was the connection accepter (listen_accept()). We used
to suspect that with the FD cache removal it had become totally
useless since changing its readiness or not wouldn't change its
status regarding the poller itself, which would be the only one
deciding to report it again.
Careful tests showed that it indeed has exactly zero effect nowadays,
the syscall numbers are exactly the same with and without, including
when enabling edge-triggered polling.
Given that there's no more API available to manipulate it and that it
was directly called as an optimization from listener_accept(), it's
about time to remove it.
No protocol defines it anymore. The last user used to be the monitor-net
stuff that got partially broken already when the tcp_drain() function
moved to conn_sock_drain() with commit e215bba95 ("MINOR: connection:
make conn_sock_drain() work for all socket families") in 1.9-dev2.
A part of this will surely move back later when non-socket connections
arrive with QUIC but better keep the API clean and implement what's
needed in time instead.
As discussed here during 2.1-dev, "monitor-net" is totally obsolete:
https://www.mail-archive.com/haproxy@formilux.org/msg35204.html
It's fundamentally incompatible with usage of SSL, and imposes the
presence of file descriptors with hard-coded syscalls directly in the
generic accept path.
It's very unlikely that anyone has used it in the last 10 years for
anything beyond testing. In the worst case if anyone would depend
on it, replacing it with "http-request return status 200 if ..." and
"mode http" would certainly do the trick.
The keyword is still detected as special by the config parser to help
users update their configurations appropriately.
As discussed here during 2.1-dev, "mode health" is totally obsolete:
https://www.mail-archive.com/haproxy@formilux.org/msg35204.html
It's fundamentally incompatible with usage of SSL, doesn't support
source filtering, and imposes the presence of file descriptors with
hard-coded syscalls directly in the generic accept path.
It's very unlikely that anyone has used it in the last 10 years for
anything beyond testing. In the worst case if anyone would depend
on it, replacing it with "http-request return status 200" and "mode
http" would certainly do the trick.
The keyword is still detected as special by the config parser to help
users update their configurations appropriately.
FCGI mux is marked with HOL blocking. On safe reuse mode, the connection
using it are placed on the sessions instead of the available lists to
avoid sharing it with several clients. On detach, if they are no more
streams, remove the connection from the session before adding it to the
idle list. If there is still used streams, do not add it to available
list as it should be already on the session list.
H2 mux is marked with HOL blocking. On safe reuse mode, the connection
using it are placed on the sessions instead of the available lists to
avoid sharing it with several clients. On detach, if they are no more
streams, remove the connection from the session before adding it to the
idle list. If there is still used streams, do not add it to available
list as it should be already on the session list.
If a connection is using a mux protocol subject to HOL blocking, add it
to the session instead of the available list to avoid sharing it with
other clients on connection reuse.
When allocating a new session on connect_server, if the mux protocol is
marked as subject of HOL blocking, add it into session instead of
available list to avoid sharing it with other clients.
On server connection migration from one thread to another, the wrong
idle thread-specific counter is decremented. This bug was introduced
since commit 3d52f0f1f8 due to the
factorization with srv_use_idle_conn. However, this statement is only
executed from conn_backend_get. Extract the decrement from
srv_use_idle_conn in conn_backend_get and use the correct
thread-specific counter.
Rename the function to srv_use_conn to better reflect its purpose as it
is also used with a newly initialized connection not in the idle list.
As a side change, the connection insertion to available list has also
been extracted to conn_backend_get. This will be useful to be able to
specify an alternative list for protocol subject to HOL risk that should
not be shared between several clients.
This bug is only present in this release and thus do not need a backport.
The loop always missed one iteration due to the incrementation done on
the for check. Move the incrementation on the loop last statement to fix
this behaviour.
This bug has a very limited impact, not at all visible to the user, but
could be backported to 2.2.
When running a pure config check (haproxy -c) we go through the deinit
phase without having allocated fdtab, so we can't blindly dereference
it. The issue was added by recent commit ae7bc4a23 ("MEDIUM: deinit:
close all receivers/listeners before scanning proxies"), no backport is
needed.
In issue #894, Coverity suspects uninitialized values for a socket's
address whose family is AF_UNSPEC but it doesn't know that the address
is not used in this case. It's not on a critical path and working around
it is trivial, let's fully declare the address. We're doing it for both
TCP and UDP, because the same principle appears at two places.
It may happen that during a temporary listener pause resulting from a
SIGTTOU, one process gets one of its sockets disabled by another process
and will not be able to recover from this situation by itself. For the
protocols supporting this (TCPv4 and TCPv6 at the moment) this situation
is detectable, so when this happens, let's put the listener into the
PAUSED state so that it remains consistent with the real socket state.
One nice effect is that just sending the SIGTTIN signal to the process
is enough to recover the socket in this case.
There is no need to backport this, this behavior has been there forever
and the fix requires to reimplement the getsockopt() call there.
For socket pairs we don't rely on a real listening socket but we need to
have a properly connected UNIX stream socket. This is what the new
sockpair_accept_conn() tries to report. Some corner cases like half
shutdown will still not be detected but that should be sufficient for
most cases we really care about.
Now we introdce a new .rx_listening() function to report if a receiver is
actually a listening socket. The reason for this is to help detect shared
sockets that might have been broken by sibling processes.
At several places we need to check if a socket is still valid and still
willing to accept connections. Instead of open-coding this, each time,
let's add a new function for this.
Currently the suspend/resume mechanism for listeners only works on Linux
and we resort to a number of tricks involving shutdown+listen+shutdown
to try to detect failures on other operating systems that do not support
it. But on Linux connect(AF_UNSPEC) also works pretty well and is much
cleaner. It still doesn't work on other operating systems but the error
is easier to detect and appears safer. So let's switch to this.
When starting with a huge maxconn (say 1 billion), the only error seen
is "No polling mechanism available". This doesn't help at all to resolve
the problem. Let's add specific alerts for the failed mallocs. Now we can
get this instead:
[ALERT] 286/154439 (23408) : Not enough memory to allocate 2000000033 entries for fdtab!
This may be backported as far as 2.0 as it helps debugging bad configurations.
There are reports of a few "SC" in logs during reloads when H2 is used
on the backend side. Christopher analysed this as being caused by the
proxy disabled test in h2_process(). As the comment says, this was done
for frontends only, and must absolutely not send a GOAWAY to the backend,
as all it will result in is to make newly queued streams fail.
The fix consists in simply testing the connection side before deciding
to send the GOAWAY.
This may be backported as far as 2.0, though for whatever reason it seems
to manifest itself only since 2.2 (probably due to changes in the outgoing
connection setup sequence).
On some operating systems, RLIM_INFINITY is set to -1 so that when the
hard limit on the number of FDs is set to unlimited, taking the MAX
of both values keeps rlim_fd_cur and everything works. But on other
systems this values is defined as the highest positive integer. This
is what was observed on a 32-bit AIX 5.1. The effect is that maxsock
becomes 2^31-1 and that fdtab allocation fails.
Note that a simple workaround consists in manually setting maxconn in
the global section.
Let's ignore unlimited as soon as we retrieve rlim_fd_max so that all
systems behave consistently.
This may be backported as far as 2.0, though it doesn't seem like it
has annoyed anyone.
This patch adds "coll" new counter and the heartbeat timer values to "show peers"
command. It also adds the elapsed time since the last handshake to new "last_hdshk"
new peer dump field.
When factoring out the pause/resume error messages in commit 775e00158
("MAJOR: signals: use protocol_pause_all() and protocol_resume_all()")
I forgot that ha_warning() and send_log() take a format string and not
just a const string. No backport is needed, this is 2.3-dev.
This one was scheduled for removal in 2.3 since 2.2-dev3 by commit
1b85785bc ("MINOR: config: mark global.debug as deprecated"). Let's
remove it now. It remains totally possible to use -d on the command
line though.
This was introduced 15 years ago or so to delay the stopping of some
services so that a monitoring device could detect its port being down
before services were stopped. Since then, clean reloads were implemented
and this doesn't cope well with reload at all, preventing the new process
from seamlessly binding, and forcing processes to coexist with half-baked
configurations.
Now it has become a real problem because there's a significant code
portion in the proxies that is solely dedicated to this obsolete feature,
and dealing with its special cases eases the introduction of bugs in
other places so it's about time that it goes.
We could tentatively schedule its removal for 2.4 with a hard deadline
for 2.5 in any case.
Now we have ->suspend() and ->resume() for listeners at the protocol
level. This means that it now becomes possible for a protocol to redefine
its own way to suspend and resume. The default functions are provided for
TCP, UDP and unix, and they are pass-through to the receiver equivalent
as it used to be till now. Nothing was defined for sockpair since it does
not need to suspend/resume during reloads, hence it will succeed.
The inner part now goes into the protocol and is used to decide how to
unbind a given protocol's listener. The existing code which is able to
also unbind the receiver was provided as a default function that we
currently use everywhere. Some complex listeners like QUIC will use this
to decide how to unbind without impacting existing connections, possibly
by setting up other incoming paths for the traffic.
This is used as a generic way to unbind a receiver at the end of
do_unbind_listener(). This allows to considerably simplify that function
since we can now let the protocol perform the cleanup. The generic code
was moved to sock.c, along with the conditional rx_disable() call. Now
the code also supports that the ->disable() function of the protocol
which acts on the listener performs the close itself and adjusts the
RX_F_BUOND flag accordingly.
This listener flag indicates whether the receiver part of the listener
is specific to the master or to the workers. In practice it's only used
by the master's CLI right now. It's used to know whether or not the FD
must be closed before forking the workers. For this reason it's way more
of a receiver's property than a listener's property, so let's move it
there under the name RX_F_MWORKER. The rest of the code remains
unchanged.
And also remove it from its callers. This subtle distinction was added as
sort of a hack for the seamless reload feature but is not needed anymore
since the do_close turned unused since commit previous commit ("MEDIUM:
listener: let do_unbind_listener() decide whether to close or not").
This also removes the unbind_listener_no_close() function.
The listener contains all the information needed to decide to close on
unbind or not. The rule is the following (when we're not stopping):
- worker process unbinding from a worker's FD with socket transfer enabled => keep
- master process unbinding from a master's inherited FD => keep
- master process unbinding from a master's FD => close
- master process unbinding from a worker's FD => close
- worker process unbinding from a master's FD => close
- worker process unbinding from a worker's FD => close
Let's translate that into the function and stop using the do_close
argument that is a bit obscure for callers. It was not yet removed
to ease code testing.
BROKEN: the failure rate on reg-tests/seamless-reload/abns_socket.vtc has
significantly increased for no obvious reason. It fails 99% of the time vs
10% before.
do_unbind_listener() is not logical and is not even idempotent. It must
not touch the fd if already -1, which also means not touch the receiver.
In addition, when performing a partial stop on a socket (not closing),
we know the socket remains in the listening state yet it's marked as
LI_ASSIGNED, which is confusing as it doesn't translate its real state.
With this change, we make sure that FDs marked for close end up in
ASSIGNED state and that those which are really bound and on which a
listen() was made (i.e. not pause) remain in LISTEN state. This is what
is closest to reality.
Ideally this function should become a default proto->unbind() one but
it may still keep a bit too much state logic to become generalized to
other protocols (e.g. QUIC).
Right now in enable_listener(), we used to start all enabled
listeners then kill from the workers those that were for the master.
But this is incomplete. We must also close from the master the
listeners that are solely for workers, and do it before we even
start them. Otherwise we end up with a master responding to the
worker CLI connections if the listener remains in listen mode to
translate the socket's real state.
It doesn't seem like it could have caused bugs in the past because we
used to aggressively mark disabled listeners as LI_ASSIGNED despite
the fact that they were still bound and listening. If this patch were
ever seen as a candidate solution for any obscure bug, be careful in
that it subtly relies on the fact that fd_delete() doesn't close
inherited FDs anymore, otherwise that could break the master's ability
to pass inherited FDs on reloads.
In Linux kernel's net/ipv4/udp.c there's a udp_disconnect() function
which is called when connecting to AF_UNSPEC, and which unhashes a
"connection". This property, which is also documented in connect(2)
both in Linux and Open Group's man pages for datagrams, is interesting
because it allows to reverse a connect() which is in fact a filter on
the source. As such we can suspend a receiver by making it connect to
itself, which will cause it not to receive any traffic anymore, letting
a new one receive it all, then resume it by breaking this connection.
This was tested to work well on Linux, other operating systems should
also be tested. Before this, sending a SIGTTOU to a process having a
UDP syslog forwarder would cause this error:
[WARNING] 280/194249 (3268) : Paused frontend GLOBAL.
[WARNING] 280/194249 (3268) : Some proxies refused to pause, performing soft stop now.
[WARNING] 280/194249 (3268) : Proxy GLOBAL stopped (cumulated conns: FE: 0, BE: 0).
[WARNING] 280/194249 (3268) : Proxy sylog-loadb stopped (cumulated conns: FE: 0, BE: 0).
With this change, it now proceeds just like with TCP listeners:
[WARNING] 280/195503 (3885) : Paused frontend GLOBAL.
[WARNING] 280/195503 (3885) : Paused frontend sylog-loadb.
And SIGTTIN also works:
[WARNING] 280/195507 (3885) : Resumed frontend GLOBAL.
[WARNING] 280/195507 (3885) : Resumed frontend sylog-loadb.
On Linux this also works with TCP listeners (which can then be resumed
using listen()) and established TCP sockets (which we currently kill
using setsockopt(so_linger)), both not being portable on other OSes.
UNIX sockets and ABNS sockets do not support it however (connect
always fails). This needs to be further explored to see if other OSes
might benefit from this to perform portable and reliable resets
particularly on the backend side.
One difficulty in soft-stopping is to make sure not to forget unlisted
listeners. By first doing a pass using protocol_stop_now() we catch the
vast majority of them. The few remaining ones are the ones belonging to
a proxy having a grace period. For these ones, the proxy will arm its
stop_time timer and emit a log message.
Since neither UDP listeners nor peers use the grace period, we can already
get rid of the special cases there since we know they will have been stopped
by the protocols.
This will instantly stop all listeners except those which belong to
a proxy configured with a grace time. This means that UDP listeners,
and peers will also be stopped when called this way.
There are multiple ways a proxy may switch to the disabled state,
but now it's essentially once it loses its last listener. Instead
of keeping duplicate code around and reporting the state change
before actually seeing it, we now report it at the moment it's
performed (from the last listener leaving) which allows to remove
the message from all other places.
For now we cannot easily distinguish a peers frontend from another one,
which will be problematic to avoid reporting them when stopping their
listeners. Let's add PR_MODE_PEERS for this. It's not supposed to cause
any issue since all non-HTTP proxies are handled similarly now.
This function will be used to definitely stop a listener (e.g. during a
soft_stop). This is actually tricky because it may be called for a proxy
or for a protocol, both of which require locks and already hold some. The
function takes booleans indicating which ones are already held, hoping
this will be enough. It's not well defined wether proto->disable() and
proto->rx_disable() are supposed to be called with any lock held, and
they are used from do_unbind_listener() with all these locks. Some back
annotations ought to be added on this point.
The proxy's listeners count is updated, and the proxy is marked as
disabled and woken up after the last one is gone. Note that a
listener in listen state is already not attached anymore since it
was disabled.
We have to count unstoppable jobs which correspond to worker sockpairs, in
order to know when to count. However the way it's currently done is quite
awkward because these are counted when stopping making the stop mechanism
non-idempotent. This is definitely something we want to fix before stopping
by protocol or our listeners count will quickly go wrong. Now they are
counted when the listeners are created.
We'll need an already locked variant of this function so let's make
__delete_listener() which will be called with the protocol lock held
and the listener's lock held.
At each place we used to manipulate the FDs directly we can now call
the listener protocol's enable/disable/rx_enable/rx_disable depending
on whether the state changes on the listener or the receiver. One
exception currently remains in listener_accept() which is a bit special
and which should be split into 2 or 3 parts in the various protocol
layers.
The test of fd_updt in do_unbind_listener() that was added by commit
a51885621 ("BUG/MEDIUM: listeners: Don't call fd_stop_recv() if fd_updt
is NULL.") could finally be removed since that part is correctly handled
in the low-level disable() function.
One disable() was added in resume_listener() before switching to LI_FULL
because rx_resume() enables polling on the FD for the receiver while
we want to disable it if the listener is full. There are different
ways to clean this up in the future. One of them could be to consider
that TCP receivers only act at the listener level. But in fact it does
not translate reality. The reality is that only the receiver is paused
and that the listener's state ought not be affected here. Ultimately
the resume_listener() function should be split so that the part
controlled by the protocols only acts on the receiver, and that the
receiver itself notifies the upper listener about the change so that
the listener protocol may decide to disable or enable polling. Conversely
the listener should automatically update its receiver when they share the
same state. Since there is no harm proceeding like this, let's keep this
for now.
These methods will be used to enable/disable accepting new connections
so that listeners do not play with FD directly anymore. Since all the
currently supported protocols work on socket for now, these are identical
to the rx_enable/rx_disable functions. However they were not defined in
sock.c since it's likely that some will quickly start to differ. At the
moment they're not used.
We have to take care of fd_updt before calling fd_{want,stop}_recv()
because it's allocated fairly late in the boot process and some such
functions may be called very early (e.g. to stop a disabled frontend's
listeners).
These methods will be used to enable/disable rx at the receiver level so
that callers don't play with FDs directly anymore. All our protocols use
the generic ones from sock.c at the moment. For now they're not used.
These will be used on receivers, to enable or disable receiving on a
listener, which most of the time just consists in enabling/disabling
the file descriptor.
We have to take care of the existence of fd_updt to know if we may
or not call fd_{want,stop}_recv() since it's not permitted in very
early boot.
Instead of calling listen() for IPPROTO_TCP in resume_listener(), let's
call the protocol's ->rx_resume() method when defined, which does the same.
This removes another hard-dependency on the fd and underlying protocol
from the generic functions.
This one undoes ->rx_suspend(), it tries to restore an operational socket.
It was only implemented for TCP since it's the only one we support right
now.
The ->pause method is inappropriate since it doesn't exactly "pause" a
listener but rather temporarily disables it so that it's not visible at
all to let another process take its place. The term "suspend" is more
suitable, since the "pause" is actually what we'll need to apply to the
FULL and LIMITED states which really need to make a pause in the accept
process. And it goes well with the use of the "resume" function that
will also need to be made per-protocol.
Let's rename the function and make it act on the receiver since it's
already what it essentially does, hence the prefix "_rx" to make it
more explicit.
The protocol struct was a bit reordered because it was becoming a real
mess between the parts related to the listeners and those for the
receivers.
Since the listeners were split into receiver+listener, this field ought
to have been renamed because it's confusing. It really links receivers
and not listeners, as most of the time it's used via rx.proto_list!
The nb_listeners field was updated accordingly.
protocol_enable_all() calls proto->enable_all() for all protocols,
which is always equal to enable_all_listeners() which in turn simply is
a generic loop calling enable_listener() always returning ERR_NONE. Let's
clean this madness by first calling enable_listener() directly from
protocol_enable_all().
These ones have never been called, they were referenced by the protocol's
disable_all for some protocols but there are no traces of their use, so
in addition to not being sure the code works, it has never been tested.
Let's remove a bit of complexity starting from there.
fd_stop_recv() has nothing to do in the generic listener code, it's per
protocol as some don't need it. For instance with abns@ it could even
lead to fd_stop_recv(-1). And later with QUIC we don't want to touch
the fd at all! It used to be that since commit f2cb169487 delegating
fd manipulation to their respective threads it wasn't possible to call
it down there but it's not the case anymore, so let's perform the action
in the protocol-specific code.
By using the same "ret" variable in the "if" block to test the return
value of pause(), the second one shadows the first one and when forcing
the result to zero in case of an error, it doesn't do anything. The
problem is that some listeners used to fail to pause in multi-process
mode and this was not reported, but their failure was automatically
resolved by the last process to pause. By properly checking for errors
we might now possibly report a race once in a while so we may have to
roll this back later if some users meet it.
The test on ==0 is wrong too since technically speaking a total stop
validates the need for a pause, but stops the listener so it's just
the resume that won't work anymore. We could switch to stopped but
it's an involuntary switch and the user will not know. Better then
mark it as paused and let the resume continue to fail so that only
the resume will eventually report an error (e.g. abns@).
This must not be backported as there is a risk of side effect by fixing
this bug, given that it hides other bugs itself.
In multi-process, the TCP pause is very brittle and we never noticed
it because the error was lost in the upper layers. The problem is that
shutdown() may fail if another process already did it, and will cause
a process to fail to pause.
What we do here in case of error is that we double-check the socket's
state to verify if it's still accepting connections, and if not, we
can conclude that another process already did the job in parallel.
The difficulty here is that we're trying to eliminate false positives
where some OSes will silently report a success on shutdown() while they
don't shut the socket down, hence this dance of shutw/listen/shutr that
only keeps the compatible ones. Probably that a new approach relying on
connect(AF_UNSPEC) would provide better results.
When temporarily pausing the listeners with SIG_TTOU, we now pause
all listeners via the protocols instead of the proxies. This has the
benefits that listeners are paused regardless of whether or not they
belong to a visible proxy. And for resuming via SIG_TTIN we do the
same, which allows to report binding conflicts and address them,
since the operation can be repeated on a per-listener basis instead
of a per-proxy basis.
While in appearance all cases were properly handled, it's impossible
to completely rule out the possibility that something broken used to
work by luck due to the scan ordering which is naturally different,
hence the major tag.
These two functions are used to pause and resume all listeners of
all protocols. They use the standard listener functions for this
so they're supposed to handle the situation gracefully regardless
of the upper proxies' states, and they will report completion on
proxies once the switch is performed.
It might be nice to define a particular "failed" state for listeners
that cannot resume and to count them on proxies in order to mention
that they're definitely stuck. On the other hand, the current
situation is retryable which is quite appreciable as well.
Till now, we used to call pause_proxy()/resume_proxy() to enable/disable
processing on a proxy, which is used during soft reloads. But since we want
to drive this process from the listeners themselves, we have to instead
proceed the other way around so that when we enable/disable a listener,
it checks if it changed anything for the proxy and notifies about updates
at this level.
The detection is made using li_ready=0 for pause(), and li_paused=0
for resume(). Note that we must not include any test for li_bound because
this state is seen by processes which share the listener with another one
and which must not act on it since the other process will do it. As such
the socket behind the FD will automatically be paused and resume without
its local state changing, but this is the limit of a multi-process system
with shared listeners.
It's quite confusing to have the test on LI_READY very low in the function
as it should be made much earlier. Just like with previous commit, let's
do it when entering. The additional states, however (limited, full) continue
to go through the whole function.
It's better not to try to perform pause() actions on wrong states, so
let's check this and make sure that all callers are now safe. This
means that we must not try to pause a listener which is already paused
(e.g. it could possibly fail if the pause operation isn't idempotent at
the socket level), nor should we try it on earlier states.
The two functions don't need to be distinguished anymore since they have
all the necessary info to act as needed on their listeners. Let's just
pass via stop_proxy() and make it check for each listener which one to
close or not.
Its sole remaining purpose was to display "proxy foo started", which
has little benefit and pollutes output for those with plenty of proxies.
Let's remove it now.
The VTCs were updated to reflect this, because many of them had explicit
counts of dropped lines to match this message.
This is tagged as MEDIUM because some users may be surprized by the
loss of this quite old message.
The remaining proxy states were only used to distinguish an enabled
proxy from a disabled one. Due to the initialization order, both
PR_STNEW and PR_STREADY were equivalent after startup, and they
would only differ from PR_STSTOPPED when the proxy is disabled or
shutdown (which is effectively another way to disable it).
Now we just have a "disabled" field which allows to distinguish them.
It's becoming obvious that start_proxies() is only used to print a
greeting message now, that we'd rather get rid of. Probably that
zombify_proxy() and stop_proxy() should be merged once their
differences move to the right place.
The enabled/disabled config options were stored into a "state" field
that is an integer but contained only PR_STNEW or PR_STSTOPPED, which
is a bit confusing, and causes a dependency with proxies. This was
renamed to "disabled" and is used as a boolean. The field was also
moved to the end of the struct to stop creating a hole and fill another
one.
Instead of looking at listeners in proxies in PR_STNEW state, we'd
rather check for listeners in those not in PR_STSTOPPED as it's only
this state which indicates the proxy was disabled. And let's check
the listeners count instead of testing the list's head.
This state was used to mention that a proxy was in PAUSED state, as opposed
to the READY state. This was causing some trouble because if a listener
failed to resume (e.g. because its port was temporarily in use during the
resume), it was not possible to retry the operation later. Now by checking
the number of READY or PAUSED listeners instead, we can accurately know if
something went bad and try to fix it again later. The case of the temporary
port conflict during resume now works well:
$ socat readline /tmp/sock1
prompt
> disable frontend testme3
> disable frontend testme3
All sockets are already disabled.
> enable frontend testme3
Failed to resume frontend, check logs for precise cause (port conflict?).
> enable frontend testme3
> enable frontend testme3
All sockets are already enabled.
This state is only set when a pause() fails but isn't even set when a
resume() fails. And we cannot recover from this state. Instead, let's
just count remaining ready listeners to decide to emit an error or not.
It's more accurate and will better support new attempts if needed.
Since v1.4 or so, it's almost not possible anymore to set this state. The
only exception is by using the CLI to change a frontend's maxconn setting
below its current usage. This case makes no sense, and for other cases it
doesn't make sense either because "full" is a vague concept when only
certain listeners are full and not all. Let's just remove this unused
state and make it clear that it's not reported. The "ready" or "open"
states will continue to be reported without being misleading as they
will be opposed to "stop".
The proxy state tries to be synthetic but that doesn't work well with
many listeners, especially for transition phases or after a failed
pause/resume.
In order to address this, we'll instead rely on counters of listeners in
a given state for the 3 major states (ready, paused, listen) and a total
counter. We'll now be able to determine a proxy's state by comparing these
counters only.
This function is used as a wrapper to set a listener's state everywhere.
We'll use it later to maintain some counters in a consistent state when
switching state so it's capital that all state changes go through it.
No functional change was made beyond calling the wrapper.
This thing was needed for an optimization used in soft_stop() which
doesn't exist anymore, so let's remove it as it's cryptic and hinders
the listeners cleanup.
The loop doesn't match anymore since the non-started listeners are in
LI_INIT and even if it had ever worked the benefit of closing zombies
at this point looks void at best.
The zombie state is not used anymore by the listeners, because in the
last two cases where it was tested it couldn't match as it was covered
by the test on the process mask. Instead now the FD is either in the
LISTEN state or the INIT state. This also avoids forcing the listener
to be single-dimensional because actually belonging to another process
isn't totally exclusive with the other states, which explains some of
the difficulties requiring to check the proc_mask and the fd sometimes.
So let's get rid of it now not to be tempted to reuse it.
The doc on the listeners state was updated.
Because of the zombie state, proxies have a skewed vision of the state
of listeners, which explains why there are hacks switching the state
from ZOMBIE to INIT in the proxy cleaning loop. This is particularly
complicated and not needed, as all the information is now available
in the protocol list and the fdtab.
What we do here instead is to first close all active listeners or
receivers by protocol and clean their protocol parts. Then we scan the
fdtab to get rid of remaining ones that were necessarily in INIT state
after a previous invocation of delete_listener(). From this point, we
know the listeners are cleaned, the can safely be freed by scanning the
proxies.
The ZOMBIE state on listener is a real mess. Listeners passing through
this state have lost their consistency with the proxy AND with the fdtab.
Plus this state is not used for all foreign listeners, only for those
belonging to a proxy that entirely runs on another process, otherwise it
stays in INIT state, which makes the usefulness extremely questionable.
But the real issue is that it's impossible to untangle the receivers
from the proxy state as long as we have this because of deinit()...
So what we do here is to start by making unbind_listener() support being
called more than once. This will permit to call it again to really close
the FD and finish the operations if it's called with an FD that's in a
fake state (such as INIT but with a valid fd).
During the startup process we don't have any fdtab nor fd_updt for quite
a long time, and as such some operations on the listeners are not
permitted, such as fd_want_*/fd_stop_* or fd_delete(). The latter is of
particular concern because it's used when stopping a disabled frontend,
and it's performed very early during check_config_validity() while there
is no fdtab yet. The trick till now relies on the listener's state which
is a bit brittle.
There is absolutely no valid reason for stopping a proxy's listeners this
early, we can postpone it after init_pollers() which will at least have
allocated fdtab.
During 2.1 development, commit f2cb16948 ("BUG/MAJOR: listener: fix
thread safety in resume_listener()") was introduced to bounce the
enabling/disabling of a listener's FD to one of its threads because
the remains of fd_update_cache() were fundamentally incompatible with
the need to call fd_want_recv() or fd_stop_recv() for another thread.
However since then we've totally dropped such code and it's totally
safe to use these functions on an FD that is solely used by another
thread (this is even used by the FD migration code). The only remaining
limitation concerning the wake up delay was addressed by previous commit
"MEDIUM: fd: always wake up one thread when enabling a foreing FD".
The current situation forces the FD management to remain in the
pause_listener() and resume_listener() functions just so that it can
bounce between threads, without having the ability to delegate it to
the suitable protocol layer.
So let's first remove this now unneeded workaround.
Since 2.2 it's safe to enable/disable another thread's FD but the fd_wake
calls will not immediately be considered because nothing wakes the other
threads up. This will have an impact on listeners when deciding to resume
them after they were paused, so at minima we want to wake up one of their
threads, just like the scheduler does on task_kill(). This is what this
patch does.
204 and 304 HTTP responses must no contain message body. These status codes are
correctly handled when the responses are received from a server. But there is no
specific processing for internal HTTP reponses (errorfile and http replies).
Now, when errorfiles or an http replies are parsed during the configuration
parsing, an error is triggered if a 204/304 message contains a body. An extra
check is also performed to ensure the body length matches the announce
content-length.
This patch should fix the issue #891. It must be backported as far as 2.0. For
2.1 and 2.0, only the http_str_to_htx() function must be fixed.
http_parse_http_reply() function does not exist.
96 bytes is announce in the C-L header for a message of body of 97 bytes. This
bug was introduced by the patch 46a030cdd ("CLEANUP: assorted typo fixes in the
code and comments").
This patch must be backported in all versions where the patch above is (the 2.2
for now).
This patch is similar to the previous one on the fcgi. Same is true for the
H2. But the bug is far harder to trigger because of the protocol cinematic. But
it may explain strange aborts in some edge cases.
A read0 received on the connection must not be handled too early by H2 streams.
If the demux buffer is not empty, the pending read0 must not be considered. The
H2 streams must not be passed in half-closed remote state in
h2s_wake_one_stream() and the CS_FL_EOS flag must not be set on the associated
conn-stream in h2_rcv_buf(). To sum up, it means, if there are still data
pending in the demux buffer, no abort must be reported to the streams.
To fix the issue, a dedicated function has been added, responsible for detecting
pending read0 for a H2 connection. A read0 is reported only if the demux buffer
is empty. This function is used instead of conn_xprt_read0_pending() at some
places.
Note that the HREM stream state should not be used to report aborts. It is
performed on h2s_wake_one_stream() function and it is a legacy of the very first
versions of the mux-h2.
This patch should be backported as far as 2.0. In the 1.8, the code is too
different to apply it like that. But it is probably useless because the mux-h2
can only be installed on the client side.
A read0 received on the connection must not be handled too early by FCGI
streams. If the demux buffer is not empty, the pending read0 must not be
considered. The FCGI streams must not be passed in half-closed remote state in
fcgi_strm_wake_one_stream() and the CS_FL_EOS flag must not be set on the
associated conn-stream in fcgi_rcv_buf(). To sum up, it means, if there are
still data pending in the demux buffer, no abort must be reported to the
streams.
To fix the issue, a dedicated function has been added, responsible for detecting
pending read0 for a FCGI connection. A read0 is reported only if the demux
buffer is empty. This function is used instead of conn_xprt_read0_pending() at
some places.
This patch should fix the issue #886. It must be backported as far as 2.1.
This patch re-introduce the "bind" statement on log forward
sections to handle syslog TCP listeners as defined in
rfc-6587.
As complement it introduce "maxconn", "backlog" and "timeout
client" statements to parameter those listeners.
Old processes didn't die if a log foward section is declared and
a soft stop is requested.
This patch fix this issue and should be backpored in banches including
the log forward feature.
Coverity reported dead code in sock_unix_bind_receiver() function. A goto clause
is unreachable because of the preceeding if/else block.
This patch should fix the issue #865. No backport needed.
There is no reason to wake up the H1 connection when a new output buffer is
retrieved after an allocation failure because only the H1 stream will fill it.
The session is always defined for a frontend connection. When a new client
connection is established, the session is set for the first H1 stream. But on
keep-alived connections, it is not set for the followings H1 streams while it is
possible.
This patch is tagged as a bug because it fixes an inconsistency in the H1
streams creation. But it does not fixed a known bug.
This patch must be backported as far as 2.0.
The condition to set CO_RFL_READ_ONCE flag is not really accurate. We must check
the request state on frontend connection only and, in the opposite, the response
state on backend connection only. Only the parsed side must be considered, not
the opposite one.
This patch must be backported to 2.2.
On deinit, when the server SSL ctx is released, we must take care to release the
cached SSL sessions stored in the array <ssl_ctx.reused_sess>. There are
global.nbthread entries in this array, each one may have a pointer on a cached
session.
This patch should fix the issue #802. No backport needed.
During the config check, the post parsing is not performed. Thus, cache filters
are not fully initialized and their cache name are never released. To be able to
release them, a flag is now set when a cache filter is fully initialized. On
deinit, if the flag is not set, it means the cache name must be freed.
The patch should fix#849. No backport needed.
[Cf: Tim is the patch author, but I added the commit message]
When a TCP listener is bound, in the tcp_bind_listener() function, a warning
message may be reported and should be displayed on verbose mode. But the warning
message is actually lost if the socket is successfully bound because we don't
fill the <errmsg> variable in this case.
This patch should fix the issue #863. No backport is needed.
A peer connection status must be considered as valid only if there is an applet
which has been instantiated for the connection to the peer. So, ->statuscode
should be considered as the last known peer connection status from the last
connection to this peer if any. To reflect this, "statuscode" field of peer dump
is renamed to "last_statuscode".
This patch also add "active"/"inactive" field after the peer location type
("remote" or "local") if an applet has been instantiated for this peer connection
or not.
Thank you to Emeric for having noticed this issue.
Must be backported in >=1.9 version.
Remove variable declaration inside a for-loop. This was introduced by my
patches serie of the implementation of dynamic stats. This is not
supported by older gcc, notably on the freebsd environment of the ci.
Use the new stats module API to integrate the dns counters in the
standard stats. This is done in order to avoid code duplication, keep
the code related to cli out of dns and use the full possibility of the
stats function, allowing to print dns stats in csv or json format.
Integrate the additional proxy stats on the html stats page. For each
module, a new column is displayed with the individual stats available as
a tooltip.
Add a boolean 'clearable' on stats module structure. If set, it forces
all the counters to be reset on 'clear counters' cli command. If not,
the counters are reset only when 'clear counters all' is used.
This is executed on startup with the registered statistics module. The
existing statistics have been merged in a list containing all
statistics for each domain. This is useful to print all available
statistics in a generic way.
Allocate extra counters for all proxies/servers/listeners instances.
These counters are allocated with the counters from the stats modules
registered on startup.
A stat module can be registered to quickly add new statistics on
haproxy. It must be attached to one of the available stats domain. The
register must be done using INITCALL on STG_REGISTER.
The stat module has a name which should be unique for each new module in
a domain. It also contains a statistics list with their name/desc and a
pointer to a function used to fill the stats from the module counters.
The module also provides the initial counters values used on
automatically allocated counters. The offset for these counters
are stored in the module structure.
Use the character '-' to mark the end of static statistics on proxy
domain. After this marker, the order of the fields is not guaranteed and
should be parsed with care.
This flag can be used to determine on what type of proxy object the
statistics should be relevant. It will be useful when adding dynamic
statistics. Currently, this flag is not used.
The domain option will be used to have statistics attached to other
objects than proxies/listeners/servers. At the moment, only the PROXY
domain is available.
Add an argument 'domain' on the 'show stats' cli command to specify the
domain. Only 'domain proxy' is available now. If not specified, proxy
will be considered the default domain.
For HTML output, only proxy statistics will be displayed.
Debug Messages emitted in lua using core.Debug() or core.log() are now only
displayed on stderr if HAProxy is started in debug mode (-d parameter on the
command line). There is no change for other message levels.
This patch should fix the issue #879. It may be backported to all stable
versions.
Create a dedicated function to loop on proxies and dump them. This will
be clearer when other object will be dump as well.
This patch is needed to extend stat support to components other than
proxies objects.
Create a dedicated function to dump a proxy as a json content. This
patch will be needed when other types of objects will be available for
json dump.
This patch is needed to extend stat support to components other than
proxies objects.
Use an opaque pointer to store proxy instance. Regroup server/listener
as a single opaque pointer. This has the benefit to render the structure
more evolutive to support statistics on other types of objects in the
future.
This patch is needed to extend stat support for components other than
proxies objects.
The prometheus module has been adapted for these changes.
Render the stats size parametric in csv/json dump functions. This is
needed for the future patch which provides dynamic stats. For now the
static value ST_F_TOTAL_FIELDS is provided.
Remove unused parameter px on stats_dump_one_line.
This patch is needed to extend stat support to components other than
proxies objects.
Un-mark stats_dump_one_line and stats_putchk as static and export them
in the header file. These functions will be reusable by other components to
print their statistics.
This patch is needed to extend stat support to components other than
proxies objects.
There is a confusion between the HAProxy bundle and OpenSSL. OpenSSL
does not have "bundles" but multiple certificates in the same store.
Fix a commentary in the crt-list code.
Since the health-check refactoring in the 2.2, the checks through a socks4 proxy
are broken. To fix this bug, CO_FL_SOCKS4 flag must be set on the connection
before calling the connect() callback function because this flags is checked to
use the right destination address. The same is done for the CO_FL_SEND_PROXY
flag for a consistency purpose.
A reg-test has been added to test the "check-via-socks4" directive.
This patch must be backported to 2.2.
The warning is only emitted for HTTP frontend. Idea is to encourage the usage of
"tcp-request session" rules to track counters that does not depend on the
request content. The documentation has been updated accordingly.
The warning is important because since the multiplexers were added in the
processing chain, the HTTP parsing is performed at a lower level. Thus parsing
errors are detected in the multiplexers, before the stream creation. In HTTP/2,
the error is reported by the multiplexer itself and the stream is never
created. This difference has a certain number of consequences, one of which is
that HTTP request counting in stick tables only works for valid H2 request, and
HTTP error tracking in stick tables never considers invalid H2 requests but only
invalid H1 ones. And the aim is to do the same with the mux-h1. This change will
not be done for the 2.3, but the 2.4. At the end, H1 and H2 parsing errors will
be caught by the multiplexers, at the session level. Thus, tracking counters at
the content level should be reserved for rules using a key based on the request
content or those using ACLs based on the request content.
To be clear, a warning will be emitted for the following rules :
tcp-request content track-sc0 src
tcp-request content track-sc0 src if ! { src 10.0.0.0/24 }
tcp-request content track-sc0 src if { ssl_fc }
But not for the following ones :
tcp-request content track-sc0 req.hdr(host)
tcp-request content track-sc0 src if { req.hdr(host) -m found }
We use chunk_initstr() to store the program name as the default log-tag.
If we use the log-tag directive in the config file, this chunk will be
destroyed and replaced. chunk_initstr() sets the chunk size to 0 so we
will free the chunk itself, but not its content.
This happens for a global section and also for a proxy.
We fix this by using chunk_initlen() instead of chunk_initstr().
We also check that the memory allocation was successfull, otherwise we quit.
This fixes github issue #850.
It can be backported as far as 1.9, with minor adjustments to includes.
this condition is never true as we either break or goto error, so those
two lines could be removed in the current state of the code.
this is fixing github issue #862
Signed-off-by: William Dauchy <w.dauchy@criteo.com>
Similar to warning during the parsing of the regular configuration file
that was added in 2fd5bdb439 this patch adds
a warning to the parsing of a crt-list if the file does not end in a
newline (and thus might have been truncated).
The logic essentially just was copied over. It might be good to refactor
this in the future, allowing easy re-use within all line-based config
parsers.
see https://github.com/haproxy/haproxy/issues/860#issuecomment-693422936
see 0354b658f0
This should be backported as a warning to 2.2.
Previous commit fa41cb679 ("MINOR: tools: support for word expansion
of environment in parse_line") introduced two new isspace() on a char
and broke the build on systems using an array disguised in a macro
instead of a function (like cygwin). Just use the usual cast.
Allow the syntax "${...[*]}" to expand an environment variable
containing several values separated by spaces as individual arguments. A
new flag PARSE_OPT_WORD_EXPAND has been added to toggle this feature on
parse_line invocation. In case of an invalid syntax, a new error
PARSE_ERR_WRONG_EXPAND will be triggered.
This feature has been asked on the github issue #165.
For some algos (roundrobin, static-rr, leastconn, first) we know that
if there is any request queued in the backend, it's because a previous
attempt failed at finding a suitable server after trying all of them.
This alone is sufficient to decide that the next request will skip the
LB algo and directly reach the backend's queue. Doing this alone avoids
an O(N) lookup when load-balancing on a saturated farm of N servers,
which starts to be very expensive for hundreds of servers, especially
under the lbprm lock. This change alone has increased the request rate
from 110k to 148k RPS for 200 saturated servers on 8 threads, and
fwlc_reposition_srv() doesn't show up anymore in perf top. See github
issue #880 for more context.
It could have been the same for random, except that random is performed
using a consistent hash and it only considers a small set of servers (2
by default), so it may result in queueing at the backend despite having
some free slots on unknown servers. It's no big deal though since random()
only performs two attempts by default.
For hashing algorithms this is pointless since we don't queue at the
backend, except when there's no hash key found, which is the least of
our concerns here.
If random() returns a server whose maxconn is reached or the queue is
used, instead of adding the request to the server's queue, better add
it to the backend queue so that it can be served by any server (hence
the fastest one).
We should not exits on error out of the crtlist_parse_line() function.
The cfgerr error must be checked with the ERR_CODE mask.
Must be backported in 2.2.
If the TRACE option is used when compiling the haproxy source,
the following error occurs on debian 9.13:
src/calltrace.o: In function `make_line':
.../src/calltrace.c:204: undefined reference to `rdtsc'
src/calltrace.o: In function `calltrace':
.../src/calltrace.c:277: undefined reference to `rdtsc'
collect2: error: ld returned 1 exit status
Makefile:866: recipe for target 'haproxy' failed
The code dealing with zombie proxies in soft_stop() is bogus, it uses
close() instead of fd_delete(), leaving a live entry in the fdtab with
a dangling pointer to a free memory location. The FD might be reassigned
for an outgoing connection for the time it takes the proxy to completely
stop, or could be dumped on the CLI's "show fd" command. In addition,
the listener's FD was not even reset, leaving doubts about whether or
not it will happen again in deinit().
And in deinit(), the loop in charge of closing zombie FDs is particularly
unsafe because it closes the fd then calls unbind_listener() then
delete_listener() hoping none of them will touch it again. Since it
requires some mental efforts to figure what's done there, let's correctly
reset the fd here as well and close it using fd_delete() to eliminate any
remaining doubts.
It's uncertain whether this should be backported. Zombie proxies are rare
and the situations capable of triggering such issues are not trivial to
setup. However it's easy to imagine how things could go wrong if backported
too far. Better wait for any matching report if at all (this code has been
there since 1.8 without anobody noticing).
There's a nasty case with listeners that belong to foreign processes.
If a proxy is defined this way:
global
nbproc 2
frontend f
bind :1111 process 1
bind :2222 process 2
and if stats expose-fd listeners is set, the listeners' FDs will not
be closed on the processes that don't use them. At this point it's not
a big deal, except that they're shared between processes and that a
"disable frontend f" issued on one process will pause all of them and
cause the other process to see accept() fail, turning its own listener
to state LI_LIMITED to try to leave it some time to recover. But it
will never recover, even after an enable.
The root cause of the issue is that the ZOMBIE state doesn't cover
this situation since it's only for a proxy being entirely bound to a
process.
What we do here to address this is that we refrain from pausing a
file descriptor that belongs to a foreign process in pause_listener().
This definitely solves the problem. A similar test is present in
resume_listener() and is the reason why the FD doesn't recover upon the
"enable" action by the way.
This ought to be backported to 1.8 where seamless reload was integrated.
The config above should be sufficient to validate that the fix works;
after a pair of "disable/enable frontend" no process will handle the
traffic to one of the ports anymore.
Since we've fixed the way URIs are handled in 2.1, some users have started
to experience inconsistencies in "balance uri" between requests received
over H1 and the same ones received over H2. This is caused by the fact
that H1 rarely uses absolute URIs while H2 always uses them. Similar
issues were reported already around replace-uri etc, leading to "pathq"
recently being introduced, so this isn't new.
Here what this patch does is add a new option to "balance uri" to indicate
that the hashing should only start at the path and not cover the authority.
This makes H1 relative URIs and H2 absolute URI hashes equally again.
Some extra options could be added to normalize URIs by always hashing the
authority (or host) in front of them, which would make sure that both
absolute and relative requests provide the same hash. This is left for
later if needed.
This memory leak happens if there is two or more defaults section. When
the default proxy is reinitialized, the structure member containing the
config filename must be freed.
Fix github issue #851.
Should be backported as far as 1.6.
When memory allocation fails in cfg_parse_peers or when an error occurs
while parsing a stick-table, the temporary table and its id must be freed.
This fixes github issue #854. It should be backported as far as 2.0.
A subtle bug was introduced by the commit a6d9879e6 ("BUG/MEDIUM: htx:
smp_prefetch_htx() must always validate the direction"), for the "method"
sample fetch only. The sample data type and the method id are always
overwritten because smp_prefetch_htx() function is called later in the
sample fetch evaluation. The bug is in the smp_prefetch_htx() function but
it is only visible for the "method" sample fetch, for an unknown method.
In fact, when smp_prefetch_htx() is called, the sample object is
altered. The data type is set to SMP_T_BOOL and, on success, the data value
is set to 1. Thus, if the caller has already set some infos into the sample
object, they may be lost. AFAIK, there is no reason to do so. It is
inherited from the legacy HTTP code and I honestely don't known why it was
done this way. So, instead of fixing the "method" sample fetch to set useful
info after the call to smp_prefetch_htx() function, I prefer to not alter
the sample object in smp_prefetch_htx().
This patch must be backported as far as 2.0. On the 2.0, only the HTX part
must be fixed.
When sending a frame ACK, the parser state is not equal to H2_CS_FRAME_H
and we used to report it as an error, which is not true. In fact we should
only indicate when we skip remaining data.
This may be backported as far as 2.1.
I was careful to have it for sock_unix.c but missed it for sock_inet
which broke with commit 36722d227 ("MINOR: sock_inet: report the errno
string in binding errors") depending on the build options. No backport
is needed.
Just like with previous patch, let's report UNIX socket binding errors
in plain text. we can now see for example:
[ALERT] 260/083531 (13365) : Starting frontend f: cannot switch final and temporary UNIX sockets (Operation not permitted) [/tmp/root.sock]
[ALERT] 260/083640 (13375) : Starting frontend f: cannot change UNIX socket ownership (Operation not permitted) [/tmp/root.sock]
With the socket binding code cleanup it becomes easy to add more info to
error messages. One missing thing used to be the error string, which is
now added after the generic one, for example:
[ALERT] 260/082852 (12974) : Starting frontend f: cannot bind socket (Permission denied) [0.0.0.0:4]
[ALERT] 260/083053 (13292) : Starting frontend f: cannot bind socket (Address already in use) [0.0.0.0:4444]
[ALERT] 260/083104 (13298) : Starting frontend f: cannot bind socket (Cannot assign requested address) [1.1.1.1:4444]
We used to resort to a trick to detect whether the caller was a listener
or an outgoing socket in order never to present an AF_CUST_UDP* socket
to a log server nor a nameserver. This is no longer necessary, the socket
type alone will be enough.
We don't need to cheat with the sock_domain anymore, we now always have
the SOCK_DGRAM sock_type as a complementary selector. This patch restores
the sock_domain to AF_INET* in the udp* protocols and removes all traces
of the now unused AF_CUST_*.
By doing so we can remove the hard-coded mapping from AF_INET to AF_CUST_UDP
but we still need to keep the test on the listeners as long as these dummy
families remain present in the code.
The protocol array used to be only indexed by socket family, which is very
problematic with UDP (requiring an extra family) and with the forthcoming
QUIC (also requiring an extra family), especially since that binds them to
certain families, prevents them from supporting dgram UNIX sockets etc.
In order to address this, we now start to register the protocols with more
info, namely the socket type and the control type (either stream or dgram).
This is sufficient for the protocols we have to deal with, but could also
be extended further if multiple protocol variants were needed. But as is,
it still fits nicely in an array, which is convenient for lookups that are
instant.
This one will be needed to more accurately select a protocol. It may
differ from the socket type for QUIC, which uses dgram at the socket
layer and provides stream at the control layer. The upper level requests
a control layer only so we need this field.
Most callers of str2sa_range() need the protocol only to check that it
provides a ->connect() method. It used to be used to verify that it's a
stream protocol, but it might be a bit early to get rid of it. Let's keep
the test for now but move it to str2sa_range() when the new flag PA_O_CONNECT
is present. This way almost all call places could be cleaned from this.
There's a strange test in the server address parsing code that rechecks
the family from the socket which seems to be a duplicate of the previously
removed tests. It will have to be rechecked.
We'll need this so that it can return pointers to stacked protocol in
the future (for QUIC). In addition this removes a lot of tests for
protocol validity in the callers.
Some of them were checked further apart, or after a call to
str2listener() and they were simplified as well.
There's still a trick, we can fail to return a protocol in case the caller
accepts an fqdn for use later. This is what servers do and in this case it
is valid to return no protocol. A typical example is:
server foo localhost:1111
The function will need to use more than just a family, let's pass it
the selected protocol. The caller will then be able to do all the fancy
stuff required to pick the best protocol.
str2listener() was temporarily hacked to support datagram sockets for
the log-forward listeners. This has has an undesirable side effect that
"bind udp@1.2.3.4:5555" was silently accepted as TCP for a bind line.
We don't need this hack anymore since the only user (log-forward) now
relies on str2receiver(). Now such an address will properly be rejected.
Thanks to this we don't need to specify "udp@" as it's implicitly a
datagram type listener that is expected, so any AF_INET/AF_INET4 address
will work.
This is at least temporary, as the migration at once is way too difficuly.
For now it still creates listeners but only allows DGRAM sockets. This
aims at easing the split between listeners and receivers.
Now we only rely on dgram type associated with AF_INET/AF_INET6 to infer
UDP4/UDP6. We still keep the hint based on PA_O_SOCKET_FD to detect that
the caller is a listener though. It's still far from optimal but UDP
remains rooted into the protocols and needs to be taken out first.
For now only listeners can make use of AF_CUST_UDP and it requires hacks
in the DNS and logsrv code to remap it to AF_INET. Make str2sa_range()
smarter by detecting that it's called for a listener and only set these
protocol families for listeners. This way we can get rid of the hacks.
The parser now supports a socket type for the control layer and a possible
other one for the transport layer. Usually they are the same except for
protocols like QUIC which will provide a stream transport layer based on
a datagram control layer. The default types are preset based on the caller's
expectations, and may be refined using "stream+" and "dgram+" prefixes.
For now they were not added to the docuemntation because other changes
will probably happen around UDP as well. It is conceivable that "tcpv4@"
or "udpv6@" will appear later as aliases for "stream+ipv4" or "dgram+ipv6".
Just like for inherited sockets, we want to make sure that FDs that are
mentioned in "sockpair@" are actually usable. Right now this test is
performed by the callers, but not everywhere. Typically, the following
config will fail if fd #5 is not bound:
frontend
bind sockpair@5
But this one will pass if fd #6 is not bound:
backend
server s1 sockpair@6
Now both will return an error in such a case:
- 'bind' : cannot use file descriptor '5' : Bad file descriptor.
- 'server s1' : cannot use file descriptor '6' : Bad file descriptor.
As such the test in str2listener() is not needed anymore (and it was
wrong by the way, as it used to test for the socket by overwriting the
local address with a new address that's made of the FD encoded on 16
bits and happens to still be at the same place, but that strictly
depends on whatever the kernel wants to put there).
Since previous patch we know that a successfully bound fd@XXX socket
is returned as its own protocol family from str2sa_range() and not as
AF_CUST_EXISTING_FD anymore o we don't need to check for that case
in str2listener().
When str2sa_range() is invoked for a bind or log line, and it gets a file
descriptor number, it will immediately resolve the socket's address (when
it's a socket) so that the address family, address and port are correctly
set. This will later allow to resolve some transport protocols that are
attached to existing FDs. For raw FDs (e.g. logs) and for socket pairs,
the FD number is still returned in the address, because we need the
underlying address management to complete the bind/listen/connect/whatever
needed. One immediate benefit is that passing a bad FD will now result in
one of these errors:
'bind' : cannot use file descriptor '3' : Socket operation on non-socket.
'bind' : socket on file descriptor '3' is of the wrong type.
Note that as of now, we never return a listening socket with a family of
AF_CUST_EXISTING_FD. The only case where this family is seen is for a raw
FD (e.g. logs).
If a file descriptor was passed, we can optionally return it. This will
be useful for listening sockets which are both a pre-bound FD and a ready
socket.
These flags indicate whether the call is made to fill a bind or a server
line, or even just send/recv calls (like logs or dns). Some special cases
are made for outgoing FDs (e.g. pipes for logs) or socket FDs (e.g external
listeners), and there's a distinction between stream or dgram usage that's
expected to significantly help str2sa_range() proceed appropriately with
the input information. For now they are not used yet.
Now that str2sa_range() checks for appropriate port specification, we
don't need to implement adhoc test cases in every call place, if the
result is valid, the conditions are met otherwise the error message is
appropriately filled.
Now str2sa_range() will enforce the caller's port specification passed
using the PA_O_PORT_* flags, and will return an error on failure. For
optional ports, values 0-65535 will be enforced. For mandatory ports,
values 1-65535 are enforced. In case of ranges, it is also verified that
the upper bound is not lower than the lower bound, as this used to result
in empty listeners.
I couldn't find an easy way to test this using VTC since the purpose is
to trigger parse errors, so instead a test file is provided as
tests/ports.cfg with comments about what errors are expected for each
line.
These flags indicate what is expected regarding port specifications. Some
callers accept none, some need fixed ports, some have it mandatory, some
support ranges, and some take an offset. Each possibilty is reflected by
an option. For now they are not exploited, but the goal is to instrument
str2sa_range() to properly parse that.
We currently have an argument to require that the address is resolved
but we'll soon add more, so let's turn it into a bit field. The old
"resolve" boolean is now PA_O_RESOLVE.
The code is built to match prefixes at one place and to parse the address
as a second step, except for fd@ and sockpair@ where the test first passes
via AF_UNSPEC that is changed again. This is ugly and confusing, so let's
proceed like for the other ones.
At some places (log fd@XXX, bind fd@XXX) we support using an explicit
file descriptor number, that is placed into the sockaddr for later use.
The problem is that till now it was done with an AF_UNSPEC family, which
is also used for other situations like missing info or rings (for logs).
Let's create an "official" family AF_CUST_EXISTING_FD for this case so
that we are certain the FD can be found in the address when it is set.
This removes the following fields from struct protocol that are now
retrieved from the protocol family instead: .sock_family, .sock_addrlen,
.l3_addrlen, .addrcmp, .bind, .get_src, .get_dst.
This also removes the UDP-specific udp{,6}_get_{src,dst}() functions
which were referenced but not used yet. Their goal was only to remap
the original AF_INET* addresses to AF_CUST_UDP*.
Note that .sock_domain is still there as it's used as a selector for
the protocol struct to be used.
We now take care of retrieving sock_family, l3_addrlen, bind(),
addrcmp(), get_src() and get_dst() from the protocol family and
not just the protocol itself. There are very few places, this was
only seldom used. Interestingly in sock_inet.c used to rely on
->sock_family instead of ->sock_domain, and sock_unix.c used to
hard-code PF_UNIX instead of using ->sock_domain.
Also it appears obvious we have something wrong it the protocol
selection algorithm because sock_domain is the one set to the custom
protocols while it ought to be sock_family instead, which would avoid
having to hard-code some conversions for UDP namely.
We need to specially handle protocol families which regroup common
functions used for a given address family. These functions include
bind(), addrcmp(), get_src() and get_dst() for now. Some fields are
also added about the address family, socket domain (protocol family
passed to the socket() syscall), and address length.
These protocol families are referenced from the protocols but not yet
used.
All protocol's listeners now only take care of themselves and not of
the receiver anymore since that's already being done in proto_bind_all().
Now it finally becomes obvious that UDP doesn't need a listener, as the
only thing it does is to set the listener's state to LI_LISTEN!
Now protocol_bind_all() starts the receivers before their respective
listeners so that ultimately we won't need the listeners for non-
connected protocols.
We still have to resort to an ugly trick to set the I/O handler in
case of syslog over UDP because for now it's still not set in the
receiver, so we hard-code it.
Note that for now we don't have a sockpair.c file to host that unusual
family, so the new function was placed directly into proto_sockpair.c.
It's no big deal given that this family is currently not shared with
multiple protocols.
The function does almost nothing but setting up the receiver. This is
normal as the socket the FDs are passed onto are supposed to have been
already created somewhere else, and the only usable identifier for such
a socket pair is the receiving FD itself.
The function was assigned to sockpair's ->bind() and is not used yet.
This removes all the AF_UNIX-specific code from uxst_bind_listener()
and now simply relies on sock_unix_bind_listener() to do the same
job. As mentionned in previous commit, the only difference is that
now an unlikely failure on listen() will not result in a roll back
of the temporary socket names since they will have been renamed
during the bind() operation (as expected). But such failures do not
correspond to any normal case and mostly denote operating system
issues so there's no functionality loss here.
This function performs all the bind-related stuff for UNIX sockets that
was previously done in uxst_bind_listener(). There is a very tiny
difference however, which is that previously, in the unlikely event
where listen() would fail, it was still possible to roll back the binding
and rename the backup to the original socket. Now we have to rename it
before calling returning, hence it will be done before calling listen().
However, this doesn't cover any particular use case since listen() has no
reason to fail there (and the rollback is not done for inherited sockets),
that was just done that way as a generic error processing path.
The code is not used yet and is referenced in the uxst proto's ->bind().
This removes all the AF_INET-specific code from udp_bind_listener()
and now simply relies on sock_inet_bind_listener() to do the same
job. The function is now basically just a wrapper around
sock_inet_bind_receiver().
This removes all the AF_INET-specific code from tcp_bind_listener()
and now simply relies on sock_inet_bind_listener() to do the same
job. The function was now roughly cut in half and its error path
significantly simplified.
This function collects all the receiver-specific code from both
tcp_bind_listener() and udp_bind_listener() in order to provide a more
generic AF_INET/AF_INET6 socket binding function. For now the API is
not very elegant because some info are still missing from the receiver
while there's no ideal place to fill them except when calling ->listen()
at the protocol level. It looks like some polishing code is needed in
check_config_validity() or somewhere around this in order to finalize
the receivers' setup. The main issue is that listeners and receivers
are created *before* bind_conf options are parsed and that there's no
finishing step to resolve some of them.
The function currently sets up a receiver and subscribes it to the
poller. In an ideal world we wouldn't subscribe it but let the caller
do it after having finished to configure the L4 stuff. The problem is
that the caller would then need to perform an fd_insert() call and to
possibly set the exported flag on the FD while it's not its job. Maybe
an improvement could be to have a separate sock_start_receiver() call
in sock.c.
For now the function is not used but it will soon be. It's already
referenced as tcp and udp's ->bind().
The new RX_O_FOREIGN, RX_O_V6ONLY and RX_O_V4V6 options are now set into
the rx_settings part during the parsing, so that we don't need to adjust
them in each and every listener anymore. We have to keep both v4v6 and
v6only due to the precedence from v6only over v4v6.
It's the receiver's FD that's inherited from the parent process, not
the listener's so the flag must move to the receiver so that appropriate
actions can be taken.
In order to split the receiver from the listener, we'll need to know that
a socket is already bound and ready to receive. We used to do that via
tha LI_O_ASSIGNED state but that's not sufficient anymore since the
receiver might not belong to a listener anymore. The new RX_F_BOUND flag
is used for this.
Some socket settings used to be retrieved via the listener and the
bind_conf. Now instead we use the receiver and its settings whenever
appropriate. This will simplify the removal of the dependency on the
listener.
A receiver will have to pass a context to be installed into the fdtab
for use by the handler. We need to set this into the receiver struct
as the bind will happen longer after the configuration.
Just like listeners keep a pointer to their bind_conf, receivers now also
have a pointer to their rx_settings. All those belonging to a listener are
automatically initialized with a pointer to the bind_conf's settings.
sock_find_compatible_fd() can now access the protocol via the receiver
hence it can access its socket type and know whether the receiver has
dgram or stream sockets, so we don't need to hack around AF_CUST_UDP*
anymore there.
The receiver is the one which depends on the protocol while the listener
relies on the receiver. Let's move the protocol there. Since there's also
a list element to get back to the listener from the proto list, this list
element (proto_list) was moved as well. For now when scanning protos, we
still see listeners which are linked by their rx.proto_list part.
The listening socket is represented by its file descriptor, which is
generic to all receivers and not just listeners, so it must move to
the rx struct.
It's worth noting that in order to extend receivers and listeners to
other protocols such as QUIC, we'll need other handles than file
descriptors here, and that either a union or a cast to uintptr_t
will have to be used. This was not done yet and the field was
preserved under the name "fd" to avoid adding confusion.
The netns is common to all listeners/receivers and is used to bind the
listening socket so it must be in the receiver settings and not in the
listener. This removes some yet another set of unnecessary loops.
The interface is common to all listeners/receivers and is used to bind
the listening socket so it must be in the receiver settings and not in
the listener. This removes some unnecessary loops.
There currently is a large inconsistency in how binding parameters are
split between bind_conf and listeners. It happens that for historical
reasons some parameters are available at the listener level but cannot
be configured per-listener but only for a bind_conf, and thus, need to
be replicated. In addition, some of the bind_conf parameters are in fact
for the listening socket itself while others are for the instanciated
sockets.
A previous attempt at splitting listeners into receivers failed because
the boundary between all these settings is not well defined.
This patch introduces a level of listening socket settings in the
bind_conf, that will be detachable later. Such settings that are solely
for the listening socket are:
- unix socket permissions (used only during binding)
- interface (used for binding)
- network namespace (used for binding)
- process mask and thread mask (used during startup)
The rest seems to be used only to initialize the resulting sockets, or
to control the accept rate. For now, only the unix params (bind_conf->ux)
were moved there.
Just like with previous commit, DNS nameservers are affected as well with
addresses starting in "udp@", but here it's different, because due to
another bug in the DNS parser, the address is rejected, indicating that
it doesn't have a ->connect() method. Similarly, the DNS code believes
it's working on top of TCP at this point and this used to work because of
this. The same fix is applied to remap the protocol and the ->connect test
was dropped.
No backport is needed, as the ->connect() test will never strike in 2.2
or below.
Commit 3835c0dcb ("MEDIUM: udp: adds minimal proto udp support for
message listeners.") introduced a problematic side effect in log server
address parser: if "udp@", "udp4@" or "udp6@" prefixes a log server's
address, the adress is passed as-is to the log server with a non-existing
family and fails like this when trying to send:
[ALERT] 259/195708 (3474) : socket() failed in logger #1: Address family not supported by protocol (errno=97)
The problem is that till now there was no UDP family, so logs expect an
AF_INET family to be passed for UDP there.
This patch manually remaps AF_CUST_UDP4 and AF_CUST_UDP6 to their "tcp"
equivalent that the log server parser expects. No backport is needed.
Remove the last utility functions for handling the multi-cert bundles
and remove the multi-variable from the ckch structure.
With this patch, the bundles are completely removed.
The multi variable is not useful anymore since the removal of the
multi-certificates bundle support. It can be removed safely from the CLI
functions and suppose that every ckch contains a single certificate.
Since the removal of the multi-certificates bundle support, this
variable is not useful anymore, we can remove all tests for this
variable and suppose that every ckch contains a single certificate.
Like the previous commit, this one emulates the bundling by loading each
certificate separately and storing it in a separate SSL_CTX.
This patch does it for the standard certificate loading, which means
outside directories or crt-list.
The multi-certificates bundle was the common way of offering multiple
certificates of different types (ecdsa and rsa) for a same SSL_CTX.
This was implemented with OpenSSL 1.0.2 before the client_hello callback
was available.
Now that all versions which does not support this callback are
deprecated (< 1.1.0), we can safely removes the support for the bundle
which was inconvenient and complexify too much the code.
The multi-certificates bundle was the common way of offering multiple
certificates of different types (ecdsa and rsa) for a same SSL_CTX.
This was implemented with OpenSSL 1.0.2 before the client_hello callback
was available.
Now that all versions which does not support this callback are
depracated (< 1.1.0), we can safely removes the support for the bundle
which was inconvenient and complexify too much the code.
This patch emulates the bundle loading by looking for the bundle files
when the specified file in the configuration does not exist. It then
creates new entries in the crtlist, so they will appear as new line if
they are dumped from the CLI.
Remove the support for multi-certificates bundle in the CLI. There is
nothing to replace here, it will use the standard codepath with the
"bundle emulation" in the future.
The multi-cert certificates bundle is the former way, implemented with
openssl 1.0.2, of doing multi-certificate (RSA, ECDSA and DSA) for the
same SNI host. Remove this support temporarely so it is replaced by
the loading of each certificate in a separate SSL_CTX.
The use of "bind" wasn't that wise but was temporary. The problem is that
it will not allow to coexist with tcp. Let's explicitly call it "dgram-bind"
so that datagram listeners are expected here, leaving some room for stream
listeners later. This is the only change.
Since the refactoring of the crt-list, the same function is used to
parse a crt-list file and a crt-list line on the CLI.
The assumption that a line on the CLI and a line in a file is finished
by a \n was made. However that is potentialy not the case with a file
which does not finish by a \n.
This patch fixes issue #860 and must be backported in 2.2.
In the SSL code, when we were waiting for the availability of the crypto
engine, once it is ready and its fd's I/O handler is called, don't call
ssl_sock_io_cb() directly, instead, call tasklet_wakeup() on the
ssl_sock_ctx's tasklet. We were calling ssl_sock_io_cb() with NULL as
a tasklet, which used to be fine, but it is no longer true since the
fd takeover changes. We could just provide the tasklet, but let's just
wake the tasklet, as is done for other FDs, for fairness.
This should fix github issue #856.
This should be backported into 2.2.
The socks4 keyword parser was a bit too much copy-pasted, it only checks
for a null port and reports "invalid range". Let's properly check for the
1-65535 range and report the correct error.
It may be backported everywhere "socks4" is present (2.0).
In bug #835, @arjenzorgdoc reported that the verifyhost option on the
server line is case-sensitive, that shouldn't be the case.
This patch fixes the issue by replacing memcmp by strncasecmp and strcmp
by strcasecmp. The patch was suggested by @arjenzorgdoc.
This must be backported in all versions supporting the verifyhost
option.
Changes performed using the following coccinelle patch:
@@
type T;
expression E;
expression t;
@@
(
t = calloc(E, sizeof(*t))
|
- t = calloc(E, sizeof(T))
+ t = calloc(E, sizeof(*t))
)
Looking through the commit history, grepping for coccinelle shows that the same
replacement with a different patch was already performed in the past in commit
02779b6263.
newsrv->curr_idle_thr is of type `unsigned int`, not `int`. Fix this issue
by simply passing the dereferenced pointer to sizeof, which is the preferred
style anyway.
This bug was introduced in commit dc2f2753e9.
It first appeared in 2.2-dev5. The patch must be backported to 2.2+.
It is notable that the `calloc` call was not introduced within the commit in
question. The allocation was already happening before that commit and it
already looked like it does after applying the patch. Apparently the
argument for the `sizeof` managed to get broken during the rearrangement
that happened in that commit:
for (i = 0; i < global.nbthread; i++)
- MT_LIST_INIT(&newsrv->idle_orphan_conns[i]);
- newsrv->curr_idle_thr = calloc(global.nbthread, sizeof(*newsrv->curr_idle_thr));
+ MT_LIST_INIT(&newsrv->safe_conns[i]);
+
+ newsrv->curr_idle_thr = calloc(global.nbthread, sizeof(int));
Even more notable is that I previously fixed that *exact same* allocation in
commit 017484c80f.
So apparently it was managed to break this single line twice in the same
way for whatever reason there might be.
iif() takes a boolean as input and returns one of the two argument
strings depending on whether the boolean is true.
This converter most likely is most useful to return the proper scheme
depending on the value returned by the `ssl_fc` fetch, e.g. for use within
the `x-forwarded-proto` request header.
However it can also be useful for use within a template that is sent to
the client using `http-request return` with a `lf-file`. It allows the
administrator to implement a simple condition, without needing to prefill
variables within the regular configuration using `http-request
set-var(req.foo)`.
It must be done to expire patterns cached in the LRU cache. Otherwise it is
possible to retrieve an already freed pattern, attached to a released pattern
expression.
When a specific pattern is deleted (->delete() callback), the pattern expression
revision is already renewed. Thus it is not affected by this bug. Only prune
action on the pattern expression is concerned.
In addition, for a pattern expression, in ->prune() callbacks when the pattern
list is released, a missing LIST_DEL() has been added. It is not a real issue
because the list is reinitialized at the end and all elements are released and
should never be reused. But it is less confusing this way.
This bug may be triggered when a map is cleared from the cli socket. A
workaround is to set the pattern cache size (tune.pattern.cache-size) to 0 to
disable it.
This patch should fix the issue #844. It must be backported to all supported
versions.
This allocation technically is always reachable and cannot leak, however other
global variables such as `oldpids` are already being freed. This is in an
attempt to get HAProxy to a state where there are zero live allocations after a
clean exit.
Given the following example configuration:
listen http
bind *:80
mode http
stats scope .
Running a configuration check with valgrind reports:
==16341== 26 (24 direct, 2 indirect) bytes in 1 blocks are definitely lost in loss record 3 of 13
==16341== at 0x4C2FB55: calloc (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==16341== by 0x571C2E: stats_add_scope (uri_auth.c:296)
==16341== by 0x46CE29: cfg_parse_listen (cfgparse-listen.c:1901)
==16341== by 0x45A112: readcfgfile (cfgparse.c:2078)
==16341== by 0x50A0F5: init (haproxy.c:1828)
==16341== by 0x418248: main (haproxy.c:3012)
After this patch is applied the leak is gone as expected.
This is a very minor leak that can only be observed if deinit() is called,
shortly before the OS will free all memory of the process anyway. No
backport needed.
It initially looked appealing to be able to call traces with ",,," for
unused arguments, but tcc doesn't like empty macro arguments, and quite
frankly, adding a zero between the few remaining ones is no big deal.
Let's do so now.
Commit 77b98220e ("BUG/MINOR: threads: work around a libgcc_s issue with
chrooting") tried to address an issue with libgcc_s being loaded too late.
But it turns out that the symbol used there isn't present on armhf, thus
it breaks the build.
Given that the issue manifests itself during pthread_exit(), the safest
and most portable way to test this is to call pthread_exit(). For this
we create a dummy thread which exits, during the early boot. This results
in the relevant library to be loaded if needed, making sure that a later
call to pthread_exit() will still work. It was tested to work fine under
linux on the following platforms:
glibc:
- armhf
- aarch64
- x86_64
- sparc64
- ppc64le
musl:
- mipsel
Just running the code under strace easily shows the call in the dummy
thread, for example here on armhf:
$ strace -fe trace=file ./haproxy -v 2>&1 | grep gcc_s
[pid 23055] open("/lib/libgcc_s.so.1", O_RDONLY|O_CLOEXEC) = 3
The code was isolated so that it's easy to #ifdef it out if needed.
This should be backported where the patch above is backported (likely
2.0).
The condition in h1_refresh_timeout() seems insufficient to properly
take care of the half-closed timeout, because depending on the ordering
of operations when performing the last send() to a client, the stream
may or may not still be there and we may fail to shrink the client
timeout on our last opportunity to do so.
Here we want to make sure that the timeout is always reduced when the
last chunk was sent and the shutdown completed, regardless of the
presence of a stream or not. This is what this patch does.
This should be backported as far as 2.0, and should fix the issue
reported in #541.
Since 1.8 with commit e8692b41e ("CLEANUP: auth: use the build options list
to report its support"), crypt(3) is always reported as being supported in
"haproxy -vv" because no test on USE_LIBCRYPT is made anymore when
producing the output.
This reintroduces the distinction between with and without USE_LIBCRYPT
in the output by indicating "yes" or "no". It may be backported as far
as 1.8, though the code differs due to a number of include files cleanups.
When the server address is set for the first time, the log message is a bit ugly
because there is no old ip address to report. Thus in the log, we can see :
PX/SRV changed its IP from to A.B.C.D by DNS additional record.
Now, when this happens, "(none)" is reported :
PX/SRV changed its IP from (none) to A.B.C.D by DNS additional record.
This patch may be backported to 2.2.
When a SRV record for an already known server is processed, only the weight is
updated, if not configured to be ignored. It is a problem if the IP address
carried by the associated additional record changes. Because the server IP
address is never renewed.
To fix this bug, If there is an addition record attached to a SRV record, we
always try to set the IP address. If it is the same, no change is
performed. This way, IP changes are always handled.
This patch should fix the issue #841. It must be backported to 2.2.
A SRV record keeps a reference on the corresponding additional record, if
any. But this additional record is also inserted in a separate linked-list into
the dns response. The problems arise when obsolete additional records are
released. The additional records list is purged but the SRV records always
reference these objects, leading to an undefined behavior. Worst, this happens
very quickly because additional records are never renewed. Thus, once received,
an additional record will always expire.
Now, the addtional record are only associated to a SRV record or simply
ignored. And the last version is always used.
This patch helps to fix the issue #841. It must be backported to 2.2.
The pathq sample fetch extract the relative URI of a request, i.e the path with
the query-string, excluding the scheme and the authority, if any. It is pretty
handy to always get a relative URI independently on the HTTP version. Indeed,
while relative URIs are common in HTTP/1.1, in HTTP/2, most of time clients use
absolute URIs.
This patch may be backported to 2.2.
These actions do the same as corresponding "-path" versions except the
query-string is included to the manipulated request path. This means set-pathq
action replaces the path and the query-string and replace-pathq action matches
and replace the path including the query-string.
This patch may be backported to 2.2.
This reverts commit 4b9c0d1fc0.
Actually, the "replace-path" action is ambiguous. "set-path" action preserves
the query-string. The "path" sample fetch does not contain the query-string. But
"replace-path" action is documented to handle the query-string. It is probably
not the expected behavior. So instead of fixing the code, we will fix the
documentation to make "replace-path" action consistent with other parts of the
code. In addition actions and sample fetches to handle the path with the
query-string will be added.
If the commit above is ever backported, this one must be as well.
It was reported in bug #837 that haproxy -s causes a 100% CPU.
However this option does not exist and haproxy must exit with the
usage message.
The parser was not handling the case where -s is not followed by 't' or
'f' which are the only two valid cases.
This bug was introduced by df6c5a ("BUG/MEDIUM: mworker: fix the copy of
options in copy_argv()") which was backported as far as 1.8.
This fix must be backported as far as 1.8.
Ever since the protocols were added in 1.3.13, listeners used to be
started twice:
- once by start_proxies(), which iteratees over all proxies then all
listeners ;
- once by protocol_bind_all() which iterates over all protocols then
all listeners ;
It's a real mess because error reporting is not even consistent, and
more importantly now that some protocols do not appear in regular
proxies (peers, logs), there is no way to retry their binding should
it fail on the last step.
What this patch does is to make sure that listeners are exclusively
started by protocols. The failure to start a listener now causes the
emission of an error indicating the proxy's name (as it used to be
the case per proxy), and retryable failures are silently ignored
during all but last attempts.
The start_proxies() function was kept solely for setting the proxy's
state to READY and emitting the "Proxy started" message and log that
some have likely got used to seeking in their logs.
Similarly to previous commit about ->bind_all(), we have the same
construct for ->unbind_all() which ought not to be used either. Let's
make protocol_unbind_all() iterate over all listeners and directly
call unbind_listener() instead.
It's worth noting that for uxst there was originally a specific
->unbind_all() function but the simplifications that came over the
years have resulted in a locally reimplemented version of the same
function: the test on (state > LI_ASSIGNED) there is equivalent to
the one on (state >= LI_PAUSED) that is used in do_unbind_listener(),
and it seems these have been equivalent since at least commit dabf2e264
("[MAJOR] added a new state to listeners")) (1.3.14).
All protocols only iterate over their own listeners list and start
the listeners using a direct call to their ->bind() function. This
code duplication doesn't make sense and prevents us from centralizing
the startup error handling. Worse, it's not even symmetric because
there's an unbind_all_listeners() function common to all protocols
without any equivalent for binding. Let's start by directly calling
each protocol's bind() function from protocol_bind_all().
Previous commit 77b98220e ("BUG/MINOR: threads: work around a libgcc_s
issue with chrooting") broke the build on cygwin. I didn't even know we
supported threads on cygwin. But the point is that it's actually the
glibc-based libpthread which requires libgcc_s, so in absence of other
reports we should not apply the workaround on other libraries.
This should be backported along with the aforementioned patch.
Sander Hoentjen reported another issue related to libgcc_s in issue #671.
What happens is that when the old process quits, pthread_exit() calls
something from libgcc_s.so after the process was chrooted, and this is
the first call to that library, causing an attempt to load it. In a
chroot, this fails, thus libthread aborts. The behavior widely differs
between operating systems because some decided to use a static build for
this library.
In 2.2 this was resolved as a side effect of a workaround for the same issue
with the backtrace() call, which is also in libgcc_s. This was in commit
0214b45 ("MINOR: debug: call backtrace() once upon startup"). But backtraces
are not necessarily enabled, and we need something for older versions.
By inspecting a significant number of ligcc_s on various gcc versions and
platforms, it appears that a few functions have been present since gcc 3.0,
one of which, _Unwind_Find_FDE() has no side effect (it only returns a
pointer). What this patch does is that in the thread initialization code,
if built with gcc >= 3.0, a call to this function is made in order to make
sure that libgcc_s is loaded at start up time and that there will be no
need to load it upon exit.
An easy way to check which libs are loaded under Linux is :
$ strace -e trace=openat ./haproxy -v
With this patch applied, libgcc_s now appears during init.
Sander confirmed that this patch was enough to put an end to the core
dumps on exit in 2.0, so this patch should be backported there, and maybe
as far as 1.8.