The aim is to rassemble all scheduler information related to the current
thread. It simply points to task_per_thread[tid] without having to perform
the operation at each time. We save around 1.2 kB of code on performance
sensitive paths and increase the request rate by almost 1%.
There are a number of tests there which are enforced on tasklets while
they will never apply (various handlers, destroyed task or not, arguments,
results, ...). Instead let's have a single TASK_IS_TASKLET() test and call
the tasklet processing function directly, skipping all the rest.
It now appears visible that the only unneeded code is the update to
curr_task that is never used for tasklets, except for opportunistic
reporting in the debug handler, which can only catch si_cs_io_cb,
which in practice doesn't appear in any report so the extra cost
incurred there is pointless.
This change alone removes 700 bytes of code, mostly in
process_runnable_tasks() and increases the performance by about
1%.
Change the tasklet code so that the tasklet list is now a mt_list.
That means that tasklet now do have an associated tid, for the thread it
is expected to run on, and any thread can now call tasklet_wakeup() for
that tasklet.
One can change the associated tid with tasklet_set_tid().
In srv_add_to_idle_list(), use LIST_DEL_INIT instead of just LIST_DEL.
We're about to add the connection to a mt_list, and MT_LIST_ADD/MT_LIST_ADDQ
will be modified to make sure we're not adding the element if it's already
in a list.
Instead of using the same type for regular linked lists and "autolocked"
linked lists, use a separate type, "struct mt_list", for the autolocked one,
and introduce a set of macros, similar to the LIST_* macros, with the
MT_ prefix.
When we use the same entry for both regular list and autolocked list, as
is done for the "list" field in struct connection, we know have to explicitely
cast it to struct mt_list when using MT_ macros.
The FCGI application handles all the configuration parameters used to format
requests sent to an application. The configuration of an application is grouped
in a dedicated section (fcgi-app <name>) and referenced in a backend to be used
(use-fcgi-app <name>). To be valid, a FCGI application must at least define a
document root. But it is also possible to set the default index, a regex to
split the script name and the path-info from the request URI, parameters to set
or unset... In addition, this patch also adds a FCGI filter, responsible for
all processing on a stream.
To avoid code duplication in the futur mux FCGI, functions parsing H1 messages
and converting them into HTX have been moved in the file h1_htx.c. Some
specific parts remain in the mux H1. But most of the parsing is now generic.
Application is a generic term here. It is a modules which handle its own log
server list, with no dependency on a proxy. Such applications can now call the
function app_log() to log messages, passing a log server list and a tag as
parameters. Internally, the function __send_log() has been adapted accordingly.
When using "http-reuse safe", which is the default, a new incoming connection
does not automatically reuse an existing connection for the first request, as
we don't want to risk to lose the contents if we know the client will not be
able to replay the request. A side effect to this is that when dealing with
mostly http-close traffic, the reuse rate is extremely low and we keep
accumulating server-side connections that may even never be reused. At some
point we're limited to a ratio of file descriptors, but when the system is
configured with very high FD limits, we can still reach the limit of outgoing
source ports and make the system significantly slow down trying to find an
available port for outgoing connections. A simple test on my laptop with
ulimit 100000 and with the following config results in the load immediately
dropping after a few seconds :
listen l1
bind :4445
mode http
server s1 127.0.0.1:8000
As can be seen, the load falls from 38k cps to 400 cps during the first 200ms
(in fact when the source port table is full and connect() takes ages to find
a spare port for a new connection):
$ injectl464 -p 4 -o 1 -u 10 -G 127.0.0.1:4445/ -F -c -w 100
hits ^hits hits/s ^h/s bytes kB/s last errs tout htime sdht ptime
2439 2439 39338 39338 356094 5743 5743 0 0 0.4 0.5 0.4
7637 5198 38185 37666 1115002 5575 5499 0 0 0.7 0.5 0.7
7719 82 25730 820 1127002 3756 120 0 0 21.8 18.8 21.8
7797 78 19492 780 1138446 2846 114 0 0 61.4 2.5 61.4
7877 80 15754 800 1150182 2300 117 0 0 58.6 0.5 58.6
7920 43 13200 430 1156488 1927 63 0 0 58.9 0.3 58.9
At this point, lots of connections are indeed in use, for only 10 connections
on the frontend side:
$ ss -ant state established | wc -l
39022
This patch makes sure we never keep more idle connections than we've ever
had outstanding requests on a server. This way the total number of idle
connections will never exceed the sum of maximum connections. Thus highly
loaded servers will be able to get many connections and slightly loaded
servers will keep less. Ideally we should apply similar limits per process
and the per backend, but in practice this already addresses the issues
pretty well:
$ injectl464 -p 4 -o 1 -u 10 -G 127.0.0.1:4445/ -F -c -w 100
hits ^hits hits/s ^h/s bytes kB/s last errs tout htime sdht ptime
4423 4423 40209 40209 645758 5870 5870 0 0 0.2 0.4 0.2
8020 3597 40100 39966 1170920 5854 5835 0 0 0.2 0.4 0.2
12037 4017 40123 40170 1757402 5858 5864 0 0 0.2 0.4 0.2
16069 4032 40172 40320 2346074 5865 5886 0 0 0.2 0.4 0.2
20047 3978 40013 39386 2926862 5842 5750 0 0 0.3 0.4 0.3
24005 3958 40008 39979 3504730 5841 5837 0 0 0.2 0.4 0.2
$ ss -ant state established | wc -l
234
This patch must be backported to 2.0. It could be useful in 1.9 as well
eventhough pools and reuse are not enabled by default there.
As mentioned in previous commit, these flags do not map well to
modern poller capabilities. Let's use the FD_EV_*_{R,W} flags instead.
This first patch only performs a 1-to-1 mapping making sure that the
previously reported flags are still reported identically while using
the closest possible semantics in the pollers.
It's worth noting that kqueue will now support improvements such as
returning distinctions between shut and errors on each direction,
though this is not exploited for now.
These two functions are used to enable recv/send but only if the FD is
not marked as active yet. The purpose is to conditionally mark them as
tentatively usable without interfering with the polling if polling was
already enabled, when it's supposed to be likely true.
Given that all our I/Os are now directed from top to bottom and not the
opposite way around, and the FD cache was removed, it doesn't make sense
anymore to create FDs that are marked not ready since this would prevent
the first accesses unless the caller explicitly does an fd_may_recv()
which is not expected to be its job (which conn_ctrl_init() has to do
by the way). Let's move this into fd_insert() instead, and have a single
atomic operation for both directions via fd_may_both().
Now that we don't have to update FD_EV_POLLED_* at the same time as
FD_EV_ACTIVE_*, we don't need to use a CAS anymore, a bit-test-and-set
operation is enough. Doing so reduces the code size by a bit more than
1 kB. One function was special, fd_done_recv(), whose comments and doc
were inaccurate for the part related to the lack of polling.
Since commit 7ac0e35f2 in 1.9-dev1 ("MAJOR: fd: compute the new fd polling
state out of the fd lock") we've started to update the FD POLLED bit a
bit more aggressively. Lately with the removal of the FD cache, this bit
is always equal to the ACTIVE bit. There's no point continuing to watch
it and update it anymore, all it does is create confusion and complicate
the code. One interesting side effect is that it now becomes visible that
all fd_*_{send,recv}() operations systematically call updt_fd_polling(),
except fd_cant_recv()/fd_cant_send() which never saw it change.
The detail level initially based on syslog levels is not used, while
something related is missing, trace verbosity, to indicate whether or
not we want to call the decoding callback and what level of decoding
we want (raw captures etc). Let's change the field to "verbosity" for
this. A verbosity of zero means that the decoding callback is not
called, and all other levels are handled by this callback and are
source-specific. The source is now prompted to list the levels that
are proposed to the user. When the source doesn't define anything,
"quiet" and "default" are available.
Working on adding traces to mux-h2 revealed that the function names are
manually copied a lot in developer traces. The reason is that they are
not preprocessor macros and as such cannot be concatenated. Let's
slightly adjust the trace() function call to take a function name just
after the file:line argument. This argument is only added for the
TRACE_DEVEL and 3 new TRACE_ENTER, TRACE_LEAVE, and TRACE_POINT macros
and left NULL for others. This way the function name is only reported
for traces aimed at the developers. The pretty-print callback was also
extended to benefit from this. This will also significantly shrink the
data segment as the "entering" and "leaving" strings will now be merged.
One technical point worth mentioning is that the function name is *not*
passed as an ist to the inline function because it's not considered as
a builtin constant by the compiler, and would lead to strlen() being
run on it from all call places before calling the inline function. Thus
instead we pass the const char * (that the compiler knows where to find)
and it's the __trace() function that converts it to an ist for internal
consumption and for the pretty-print callback. Doing this avoids losing
5-10% peak performance.
The "payload" trace level was ambigous because its initial purpose was
to be able to dump received data. But it doesn't make sense to force to
report data transfers just to be able to report state changes. For
example, all snd_buf()/rcv_buf() operations coming from the application
layer should be tagged at this level. So here we move this payload level
above the state transitions and rename it to avoid the ambiguity making
one think it's only about request/response payload. Now it clearly is
about any data transfer and is thus just below the developer level. The
help messages on the CLI and the doc were slightly reworded to help
remove this ambiguity.
Save the authority TLV in a PROXYv2 header from the client connection,
if present, and make it available as fc_pp_authority.
The fetch can be used, for example, to set the SNI for a backend TLS
connection.
Previously the callback was almost mandatory so it made sense to have it
before the message. Now that it can default to the one declared in the
trace source, most TRACE() calls contain series of empty args and callbacks,
which make them suitable for being at the end and being totally omitted.
This patch thus reverses the TRACE arguments so that the message appears
first, then the mask, then arg1..arg4, then the callback. In practice
we'll mostly see 1 arg, or 2 args and nothing else, and it will not be
needed anymore to pass long series of commas in the middle of the
arguments. However if a source is enforced, the empty commas will still
be needed for all omitted arguments.
The principle is that when emitting a message, if some dropped events
were logged, we first attempt to report this counter before going
further. This is done under an exclusive lock while all logs are
produced under a shared lock. This ensures that the dropped line is
accurately reported and doesn't accidently arrive after a later
event.
The three functions (attach, IO handler, and release) are meant to be
called by any CLI command which requires to dump the contents of a ring
buffer. We do not implement anything generic to dump any ring buffer on
the CLI since it's meant to be used by other functionalities above.
However these functions deal with locking and everything so it's trivial
to embed them in other code.
This function tries to write to the ring buffer, possibly removing enough
old messages to make room for the new one. It takes two arrays of fragments
on input to ease the insertion of prefixes by the caller. It atomically
writes the message, possibly truncating it if desired, and returns the
operation's status.
Our circular buffers are well suited for being used as ring buffers for
not-so-structured data. The machanism here consists in making room in a
buffer before inserting a new record which is prefixed by its size, and
looking up next record based on the previous one's offset and size. We
can have up to 255 consumers watching for data (dump in progress, tail)
which guarantee that entrees are not recycled while they're being dumped.
The complete representation is described in the header file. For now only
ring_new(), ring_resize() and ring_free() are created.
Currently both logs and event sinks may use a file descriptor to
atomically emit some output contents. The two may use the same FD though
nothing is done to make sure they use the same lock. Also there is quite
some redundancy between the two. Better make a specific function to send
a fragmented message to a file descriptor which will take care of the
locking via the fd's lock. The function is also able to truncate a
message and to enforce addition of a trailing LF when building the
output message.
The new TRACE_<level>() macros take a mask, 4 args, a callback and a
static message. From this they also inherit the TRACE_SOURCE macro from
the caller, which contains the pointer to the trace source (so that it's
not required to paste it everywhere), and an ist string is also made by
the concatenation of the file name and the line number. This uses string
concatenation by the preprocessor, and turns it into an ist by the compiler
so that there is no operation at all to perform to adjust the data length
as the compiler knows where to cut during the optimization phase. Last,
the message is also automatically turned into an ist so that it's trivial
to put it into an iovec without having to run strlen() on it.
All arguments and the callback may be empty and will then automatically
be replaced with a NULL pointer. This makes the TRACE calls slightly
lighter especially since arguments are not always used. Several other
options were considered to use variadic macros but there's no outstanding
rule that justifies to place an argument before another one, and it still
looks convenient to have the message be the last one to encourage copy-
pasting of the trace statements.
A generic TRACE() macro takes TRACE_LEVEL in from the source file as the
trace level instead of taking it from its name. This may slightly simplify
the production of traces that always run at the same level (internal core
parts may probably only be called at developer level).
The trace() call will support an optional decoding callback and 4
arguments that this function is supposed to know how to use to provide
extra information. The output remains unchanged when the function is
NULL. Otherwise, the message is pre-filled into the thread-local
trace_buf, and the function is called with all arguments so that it
completes the buffer in a readable form depending on the expected
level of detail.
This new "level" argument will allow the trace sources to label the
traces for different purposes, and filter out some of them if they
are not relevant to the current target. Right now we have 5 different
levels:
- USER : the least verbose one, only a few functional information
- PAYLOAD: like user but also displays some payload-related information
- PROTO: focuses on the protocol's framing
- STATE: also indicate state internal transitions or non-transitions
- DEVELOPER: adds extra info about branches taken in the code (break
points, return points)
We now pass an extra argument "where" to the trace() call, which
is supposed to be an ist made of the concatenation of the filename
and the line number. We only keep the last 10 chars from this string
since the end of file names is most often easy to recognize. This
gives developers useful information at very low cost.
For now it remains quite basic. It performs a few state checks, calls
the source's sink if defined, and performs the transitions between
RUNNING, STOPPED and WAITING when the configured events match.
For now it lists the sources if one is not provided, and checks
for the source's existence. It lists the events if not provided,
checks for their existence if provided, and adjusts reported
events/start/stop/pause events, and performs state transitions.
It lists sinks and adjusts them as well. Filters, lock, and
level are not implemented yet.
The principle of this subsystem will be to support taking live traces
at various places in the code with conditional triggers, filters, and
ability to lock on some elements. The traces will support typed events
and will be sent into sinks made of ring buffers, file descriptors or
remote servers.
This is the most basic type of sink. It pre-registers "stdout" and
"stderr", and is able to use writev() on them. The writev() operation
is locked to avoid mixing outputs. It's likely that the registration
should move somewhere else to take into account the fact that stdout
and stderr are still opened or are closed.
The principle will be to be able to dispatch events to various destinations
called "sinks". This is already done in part in logs where log servers can
be either a UDP socket or a file descriptor. This will be needed with the
new trace subsystem where we may also want to add ring buffers. And it turns
out that all such destinations make sense at all places. Logs may need to be
sent to a TCP server via a ring buffer, or consulted from the CLI. Trace
events may need to be sent to stdout/stderr as well as to remote log servers.
This patch creates a new structure "sink" aiming at addressing these similar
needs. The goal is to merge together what is common to all of them, such as
the output format, the dropped events count, etc, and also keep separately
the target identification (network address, file descriptor). Provisions
were made to have a "waiter" on the sink. For a TCP log server it will be
the task to wake up after writing to the log buffer. For a ring buffer, it
could be the list of watchers on the CLI running a "tail" operation and
waiting for new events. A lock was also placed in the struct since many
operations will require some locking, including the FD ones. The output
formats covers those in use by logs and two extra ones prepending the ISO
time in front of the message (convenient for stdio/buffer).
For now only the generic infrastructure is present, no type-specific
output is implemented. There's the sink_write() function which prepares
and formats a message to be sent, trying hard to avoid copies and only
using pointer manipulation, where the type-specific code just has to be
added. Dropped messages are already counted (for now 100% drop). The
message is put into an iovec array as it will be trivial to use with
file descriptors and sockets.
The current functions are seen outside from the debugging code and are
convenient to export so that we can improve the thread dump output :
void hlua_applet_tcp_fct(struct appctx *ctx);
void hlua_applet_http_fct(struct appctx *ctx);
struct task *hlua_process_task(struct task *task, void *context, unsigned short state);
Of course they are only available when USE_LUA is defined.
When I/O events are being processed, we want to make sure to mark the
thread as not stuck. The reason is that some pollers (like poll()) which
do not limit the number of FDs they report could possibly report a huge
amount of FD all having to perform moderately expensive operations in
the I/O callback (e.g. via mux-pt which forwards to the upper layers),
making the watchdog think the thread is stuck since it does not schedule.
Of course this must never happen but if it ever does we must be liberal
about it.
This should be backported to 2.0, where the situation may happen more
easily due to the FD cache which can start to collect a large amount of
events. It may be related to the report in issue #201 though nothing is
certain about it.
These functions perform all the boring filling of the appctx's
cli struct needed by CLI parsers to return a message or an error,
and they return 1 so that they can be used as a single-line return
statement. They may be used for const messages or dynamic messages.
When parsing references to stick-tables declared as backends, they are added to
a list of proxies (they are proxies!) which refer to this stick-tables.
Before this patch we added them to these list without checking they were already
present, making the silly hypothesis the actions/sample were checked/resolved in the same
order the proxies are parsed.
This patch implement a simple inline function to in_proxies_list() to test
the presence of a proxy in a list of proxies. We use this function when resolving
/checking samples/actions.
This bug was introduced by 015e4d7 commit.
Must be backported to 2.0.
In the poller code, instead of just remembering if we're currently polling
a fd or not, remember if we're polling it for writing and/or for reading, that
way, we can avoid to modify the polling if it's already polled as needed.
Now that the architecture was changed so that attempts to receive/send data
always come from the upper layers, instead of them only trying to do so when
the lower layer let them know they could try, we can finally get rid of the
fd cache. We don't really need it anymore, and removing it gives us a small
performance boost.
Dragan Dosen found that the listeners lock is not sufficient to protect
the listeners list when proxies are stopping because the listeners are
also unlinked from the protocol list, and under certain situations like
bombing with soft-stop signals or shutting down many frontends in parallel
from multiple CLI connections, it could be possible to provoke multiple
instances of delete_listener() to be called in parallel for different
listeners, thus corrupting the protocol lists.
Such operations are pretty rare, they are performed once per proxy upon
startup and once per proxy on shut down. Thus there is no point trying
to optimize anything and we can use a global lock to protect the protocol
lists during these manipulations.
This fix (or a variant) will have to be backported as far as 1.8.
Empty error files may be used to disable the sending of any message for specific
error codes. A common use-case is to use the file "/dev/null". This way the
default error message is overridden and no message is returned to the client. It
was supported in the legacy HTTP mode, but not in HTX. Because of a bug, such
messages triggered an error.
This patch must be backported to 2.0 and 1.9. However, the patch will have to be
adapted.
When forcing the outgoing address of a connection, till now we used to
allocate this outgoing connection and set the address into it, then set
SF_ADDR_SET. With connection reuse this causes a whole lot of issues and
difficulties in the code.
Thanks to the previous changes, it is now possible to store the target
address into the stream instead, and copy the address from the stream to
the connection when initializing the connection. assign_server_address()
does this and as a result SF_ADDR_SET now reflects the presence of the
target address in the stream, not in the connection. The http_proxy mode,
the peers and the master's CLI now use the same mechanism. For now the
existing connection code was not removed to limit the amount of tricky
changes, but the allocated connection is not used anymore.
This change also revealed a latent issue that we've been having around
option http_proxy : the address was set in the connection but neither the
SF_ADDR_SET nor the SF_ASSIGNED flags were set. It looks like the connection
could establish only due to the fact that it existed with a non-null
destination address.
Now addresses are dynamically allocated when needed. Each connection is
created with src=dst=NULL, these entries are allocated on the fly, and
released when the connection is released.
This commit places calls to sockaddr_alloc() at the places where an address
is needed, and makes sure that the allocation is properly tested. This does
not add too many error paths since connection allocations are already in the
vicinity and share the same error paths. For the two cases where a
clear_addr() was called, instead the address was not allocated.
This pool will be used to allocate storage for source and destination
addresses used in connections. Two functions sockaddr_{alloc,free}()
were added and will have to be used everywhere an address is needed.
These ones are safe for progressive replacement as they check that the
existing pointer is set before replacing it. The pool is not yet used
during allocation nor freeing. Also they operate on pointers to pointers
so they will perform checks and replace values. The free one nulls the
pointer.
This is in preparation for the switch to dynamic address allocation,
let's migrate the code using the old fields to the pointers instead.
Note that no extra check was added for now, the purpose is only to
get the code to use the pointers and still work.
In the proxy protocol message handling we make sure the addresses are
properly allocated before declaring them unset.
At the moment we're facing difficulties with connection reuse based on
the fact that connections may be allocated very early only to set a
target address in transparent mode. With the imminent removal of the
legacy mode, the connection reuse by a same stream will not exist
anymore and all this awful complexity is not justified anymore. However
we still need to be able to assign addresses somewhere.
Thus instead of allocating a connection, we'll only place addresses where
needed in the stream during operations. But this takes quite some room
(typically 128 bytes). This is a nice opportunity for cleaning all this
up and dynamically allocatating the addresses fields, which will result
in actually saving memory from connection structs since most of the time
the client's "to" address is not used and the server's "from" is not used
either, thus saving ~256 bytes per end-to-end connection.
For now these new "src" and "dst" pointers point to addr.from and addr.to.
This will allow us to smoothly update the whole code to use these pointers
prior to going further and switching them to pools.
These functions are not used anymore. They didn't report failures and
as such were often misused. conn_get_src() and conn_get_dst() now
replaced them everywhere.
The backend connect code uses conn_get_{from,to}_addr to forward addresses
in transparent mode and to map server ports, without really checking if the
operation succeeds. In preparation of future changes, let's switch to
conn_get_{src,dst}() and integrate status check for possible failures.
These functions currently are the same as conn_get_from_addr() and
conn_get_to_addr() respectively except that they return a status for
the operation that the caller can test.
Default HTTP error messages are stored in an array of chunks. And since the HTX
was added, these messages are also converted in HTX and stored in another
array. But now, the first array is not used anymore because the legacy HTTP mode
was removed.
So now, only the array with the HTX messages are kept. The other one was
removed.
The old module proto_http does not exist anymore. All code dedicated to the HTTP
analysis is now grouped in the file proto_htx.c. So, to finish the polishing
after removing the legacy HTTP code, proto_htx.{c,h} files have been moved in
http_ana.{c,h} files.
In addition, all HTX analyzers and related functions prefixed with "htx_" have
been renamed to start with "http_" instead.
First of all, all legacy HTTP analyzers and all functions exclusively used by
them were removed. So the most of the functions in proto_http.{c,h} were
removed. Only functions to deal with the HTTP transaction have been kept. Then,
http_msg and hdr_idx modules were entirely removed. And finally the structure
http_msg was lightened of all its useless information about the legacy HTTP. The
structure hdr_ctx was also removed because unused now, just like unused states
in the enum h1_state. Note that the memory pool "hdr_idx" was removed and
"http_txn" is now smaller.
This commit breaks the compatibility with filters still relying on the legacy
HTTP code. The legacy callbacks were removed (http_data, http_chunk_trailers and
http_forward_data).
For now, the filters must still set the flag FLT_CFG_FL_HTX to be used on HTX
streams.
Since the legacy HTTP mode is disbabled, all HTTP sample fetches work on HTX
streams. So it is safe to remove all code relying on HTTP legacy mode. Among
other things, the function smp_prefetch_http() was removed with the associated
macros CHECK_HTTP_MESSAGE_FIRST() and CHECK_HTTP_MESSAGE_FIRST_PERM().
Since the legacy HTTP mode is disabled and no multiplexer relies on it anymore,
there is no reason to have 2 multiplexer protocols for the HTTP. So the protocol
PROTO_MODE_HTX was removed and all HTTP multiplexers use now PROTO_MODE_HTTP.
A long time ago, applets were seen as an alternative to connections,
and since their respective sizes were roughly equal it appeared wise
to share the same pool. Nowadays, connections got significantly larger
but applets are not that often used, except for the cache. However
applets are mostly complementary and not alternatives anymore, as
it's very possible not to have a back connection or to share one with
other streams.
The connections will soon lose their addresses and their size will
shrink so much that appctx won't fit anymore. Given that the old
benefits of sharing these pools have long disappeared, let's stop
doing this and have a dedicated pool for appctx.
Move the logic to decide if we redispatch to a new server from
sess_update_st_cer() to a new inline function, stream_choose_redispatch(), and
use it in do_l7_retry() instead of just setting the state to SI_ST_REQ.
That way, when using L7 retries, we won't redispatch the request to another
server except if "option redispatch" is used.
This should be backported to 2.0.
Sometimes we need to delegate some list processing to a function running
on another thread. In this case the list element will simply be queued
into a dedicated self-locked list and the task responsible for this list
will be woken up, calling the associated function which will run over the
list.
This is what work_list does. Such lists will be dedicated to a limited
type of work but will significantly ease such remote handling. A function
is provided to create these per-thread lists, their tasks and to properly
bind each task to a distinct thread, so that the caller only has to store
the resulting pointer to the start of the structure.
These structures should not be abused though as each head will consume
4 pointers per thread, hence 32 bytes per thread or 2 kB for 64 threads.
When we're purging idle connections, there's a race condition, when we're
removing the connection from the idle list, to add it to the list of
connections to free, if the thread owning the connection tries to free it
at the same time.
To fix this, simply add a per-thread lock, that has to be hold before
removing the connection from the idle list, and when, in conn_free(), we're
about to remove the connection from every list. That way, we know for sure
the connection will stay valid while we remove it from the idle list, to add
it to the list of connections to free.
This should happen rarely enough that it shouldn't have any impact on
performances.
This has not been reported yet, but could provoke random segfaults.
This should be backported to 2.0.
The maximum number of idle connections for a server can be configured by setting
the server option "pool-max-conn". But when we try to add a connection in its
idle list, because of a wrong comparison, it may be rejected because there are
already "pool-max-conn - 1" idle connections.
This patch must be backported to 2.0 and 1.9.
While experimenting with potentially improved fairness and latency using
ticket locks on a Ryzen 16-thread/8-core, a very strange situation happened
a lot for some levels of traffic. Around 300k connections per second, no
more connections would be accepted on the multi-threaded listener but all
others would continue to work fine. All attempts to trace showed that the
threads were all in the trylock in the fd cache, or in the spinlock of
fd_update_events(), or in the one of fd_may_recv(). But as indicated this
was not a deadlock since the process continues to work fine.
After quite some investigation it appeared that the issue is caused by a
lack of fairness between the fdcache's trylock and these functions' spin
locks above. In fact, regardless of the success or failure of the fdcache's
attempt at grabbing the lock, the poller was calling fd_update_events()
which locks the FD once for something that can be done with a CAS, and
then calls fd_may_recv() with another lock for something that most often
didn't change. The high contention on these spinlocks leaves no chance to
any other thread to grab the lock using trylock(), and once this happens,
there is no thread left to process incoming connection events nor to stop
polling on the FD, leaving all threads at 100% CPU but partially operational.
This patch addresses the issue by using bit-test-and-set instead of the OR
in fd_may_recv() / fd_may_send() so that nothing is done if the FD was
already configured as expected. It does the same in fd_update_events()
using a CAS to check if the FD's events need to be changed at all or not.
With this patch applied, it became impossible to reproduce the issue, and
now there's no way to saturate all 16 CPUs with the load used for testing,
as no more than 1350-1400 were noticed at 300+kcps vs 1600.
Ideally this patch should go further and try to remove the remaining
incarnations of the fdlock as this seems possible, but it's difficult
enough to be done in a distinct patch that will not have to be backported.
It is possible that workloads involving a high connection rate may slightly
benefit from this patch and observe a slightly lower CPU usage even when
the service doesn't misbehave.
This patch must be backported to 2.0 and 1.9.
In the function stream_int_notify(), when the opposite stream-interface is
blocked because there is no more room into the input buffer, if the flag
CF_WRITE_PARTIAL is set on this buffer, it is unblocked. It is a way to unblock
the reads on the other side because some data was sent.
But it is a problem during the fast-forwarding because only the stream is able
to remove the flag CF_WRITE_PARTIAL. So it is possible to have this flag because
of a previous send while the input buffer of the opposite stream-interface is
now full. In such case, the opposite stream-interface will be woken up for
nothing because its input buffer is full. If the same happens on the opposite
side, we will have a loop consumming all the CPU.
To fix the bug, the opposite side is now only notify if there is some available
room in its input buffer in the function si_cs_send(), so only if some data was
sent.
This patch must be backported to 2.0 and 1.9.
When deciding if we keep an idle connection in the session, check if
the number of connections currently in the session is >= the max allowed,
not >, or we'll keep an extra connection.
This should be backported to 1.9 and 2.0.
Just calling conn_force_unsubscribe() from conn_upgrade_mux_fe() is not
enough, as there may be multiple XPRT involved. Instead, require that
any user of conn_upgrade_mux_fe() unsubscribe itself before calling it.
This should fix upgrading a TCP connection to HTX when using SSL.
This should be backported to 2.0.
The receive limit of an HTX channel must be calculated against the total size of
the HTX message. Otherwise, the buffer may never be seen as full whereas the
receive limit is 0. Indeed, the function channel_htx_full() already takes care
to add a block size to the buffer's reserve (8 bytes). So if the function
channel_htx_recv_limit() also keep a block size free in addition to the buffer's
reserve, it means that at least 2 block size will be kept free but only one will
be taken into account, freezing the stream if the option http-buffer-request is
enabled.
This patch fixes the Github issue #136. It should be backported to 2.0 and
1.9. Thanks jaroslawr (Jarosław Rzeszótko) for his help.
Revert commit fe4abe62c7.
The goal was to make sure for health-checks, we would not get sockets in
TIME_WAIT. To do so, we would not call shutdown() if linger_risk is set.
However that is wrong, and that means shutw would never be forwarded to
the server, and thus we could get connection that are never properly closed.
Instead, to fix the original problem as described here :
https://www.mail-archive.com/haproxy@formilux.org/msg34080.html
Just make sure the checks code call cs_shutr() before calling cs_shutw().
If shutr has been called, conn_sock_shutw() will make no attempt to call
shutdown(), as it knows close() will be called.
We should really review and revamp the shutr/shutw code, as described in
github issue #142.
This should be backported to 1.9 and 2.0.
In commit 86eded6c6 ("CLEANUP: tasks: rename task_remove_from_tasklet_list()
to tasklet_remove_*") which consisted in removing the casts between tasks
and tasklet, I was a bit too fast to believe that we only saw tasklets in
this function since process_runnable_tasks() also uses it with tasks under
a cast. So removing the bookkeeping on task_list_size was not appropriate.
Bah, the joy of casts which hide the real thing...
This patch does two things at once to address this mess once for all:
- it restores the decrement of task_list_size when it's a real task,
but moves it to process_runnable_task() since it's the only place
where it's allowed to call it with a task
- it moves the increment there as well and renames
task_insert_into_tasklet_list() to tasklet_insert_into_tasklet_list()
of obvious consistency reasons.
This way the increment/decrement of task_list_size is made at the only
places where the cast is enforced, so it has less risks to be missed.
The comments on top of these functions were updated to reflect that they
are only supposed to be used with tasklets and that the caller is responsible
for keeping task_list_size up to date if it decides to enforce a task there.
Now we don't have to worry anymore about how these functions work outside
of the scheduler, which is better longterm-wise. Thanks to Christopher for
spotting this mistake.
No backport is needed.
In conn_sock_shutw(), avoid calling shutdown() if linger_risk is set. Not
doing so will result in getting sockets in TIME_WAIT for some time.
This is particularly observable with health checks.
This should be backported to 1.9.
The function really only operates on tasklets, its arguments are always
tasklets cast as tasks to match the function's type, to be cast back to
a struct tasklet. Let's rename it to tasklet_remove_from_tasklet_list(),
take a struct tasklet, and get rid of the undesired task casts.
It's really confusing to call it a task because it's a tasklet and used
in places where tasks and tasklets are used together. Let's rename it
to tasklet to remove this confusion.
In the HTX structure, the field <first> is used to know where to (re)start the
analysis. It may differ from the message's head. It is especially important to
update it to handle 1xx messages, to be sure to restart the analysis on the next
message (another 1xx message or the final one). It is also updated when some
data are forwarded (the headers or part of the body). But this update is an
error and must never be done at the analysis level. It is a bug, because some
sample fetches may be used after the data forwarding (but before the first send
of course). At this stage, if the first block position does not point on the
start-line, most of HTTP sample fetches fail.
So now, when something is forwarding by HTX analyzers, the first block position
is not update anymore.
This issue was reported on Github. See #119. No backport needed.
When channel_full() is called for an HTX stream, we fall back on the HTX
version. This function is called, among other, from tcp_inspect_request(). With
this patch, the inspect delay is respected again.
This patch must be backported to 1.9.
With both I/O and tasks in the same tasklet list, we now have a very
smooth and responsive scheduler, providing a good fairness between I/O
activities. With the lower layers relying on tasklet a lot (I/O wakeup,
subscribe, etc), there may often be a large number of totally autonomous
tasklets doing their business such as forwarding data between two muxes.
But the task scheduler historically refrained from picking tasks from the
priority-ordered run queue to put them into the tasklet list until this
later had less than max_runqueue_depth entries. This was to make sure that
low-latency, high-priority tasks would have an opportunity to be dequeued
before others even if they arrive late. But the counter used for this is
still the tasklet list size, which contains countless I/O events. This
causes an unfairness between unbounded I/Os and bounded tasks, resulting
for example in the CLI responding slower when forwarding 40 Gbps of HTTP
traffic spread over a thousand of connections.
A good solution consists in sticking to the initial intent of
max_runqueue_depth which is to limit the number of tasks in the list
(to maintain fairness between them) and not to limit the number of these
tasks among tasklets. It just turns out that the task_list_size initially
was this task counter and changed over time to be a tasklet list size.
Let's simply refrain from updating it for pure tasklets so that it takes
back its original role of counting real tasks as its name implies. With
this change the CLI becomes instantly responsive under load again.
This patch may possibly be backported to 1.9 though it requires some
careful checks.
Just like we have a synchronous recv() function for the stream interface,
let's have a synchronous send function that we'll be able to call from
different places. For now this only moves the code, nothing more.
We should not update the two directions at once, in fact we should update
the Rx path after recv() and the Tx path after send(). Let's start by
splitting the update function in two for this.
The purpose of making idle-conns switch to SI_ST_CON was to make the
transition detectable and the operation retryable in case of connection
error. Now we have the RDY state for this which is much more suitable
since it indicates a validated connection on which we didn't necessarily
send anything yet. This will still lead to a transition to EST while not
requiring unnatural write polling nor connect timeouts.
The main reason for all the trouble we're facing with stream interface
error or timeout reports during the connection phase is that we currently
can't make the difference between a connection attempt and a validated
connection attempt. It is problematic because we tend to switch early
to SI_ST_EST but can't always do what we want in this state since it's
supposed to be set when we don't need to visit sess_establish() again.
This patch introduces a new state betwen SI_ST_CON and SI_ST_EST, which
is SI_ST_RDY. It indicates that we've verified that the connection is
ready. It's a transient state, like SI_ST_DIS, that cannot persist when
leaving process_stream(). For now it is not set, only verified in various
tests where SI_ST_CON was used or SI_ST_EST depending on the cases.
The stream-int state diagram was minimally updated to reflect the new
state, though it is largely obsolete and would need to be seriously
updated.
The stream interface state checks involving ranges were replaced with
checks on a set of states, already revealing some issues. No issue was
fixed, all was replaced in a one-to-one mapping for easier control. Some
checks involving a strict difference were also replaced with fields to
be clearer. At this stage, the result must be strictly equivalent. A few
tests were also turned to their bit-field equivalent for better readability
or in preparation for upcoming changes.
The test performed in the SPOE filter was swapped so that the closed and
error states are evicted first and that the established vs conn state is
tested second.
At some places we do check for ranges of stream-int states but those
are confusing as states ordering is not well known (e.g. it's not obvious
that CER is between CON and EST). Let's create a bit field from states so
that we can match multiple states at once instead. The new enum si_state_bit
contains SI_SB_* which are state bits instead of state values. The function
si_state_in() indicates if the state in argument is one of those represented
by the bit mask in second argument.
Now that the various handshakes come with their own XPRT, there's no
need for the CONN_FL_SOCK* flags, and the conn_sock_want|stop functions,
so garbage-collect them.
Add a new XPRT that is used when using non-SSL handshakes, such as proxy
protocol or Netscaler, instead of taking care of it in conn_fd_handler().
This XPRT is installed when any of those is used, and it removes itself once
the handshake is done.
This should allow us to remove the distinction between CO_FL_SOCK* and
CO_FL_XPRT*.
In channel_htx_forward() and channel_htx_forward_forever(), if the HTX message
is empty, the underlying buffer may be really empty too. And we have no warranty
the caller will call htx_to_buf() later. And in practice, it is almost never
done. So the channel's buffer must not be altered. Otherwise, the buffer may be
considered as full (data == size) for an empty HTX message and no outgoing data.
This patch must be backported to 1.9.
Make usage of the APIs implemented for dictionaries (dict.c) and their LRU caches (struct dcache)
so that to send/receive server names used for the server by name stickiness. These
names are sent over the network as follows:
- in every case we send the encode length of the data (STD_T_DICT), then
- if the server names is not present in the cache used upon transmission (struct dcache_tx)
we cache it and we the ID of this TX cache entry followed the encode length of the
server name, and finally the sever name itseft (non NULL terminated string).
- if the server name is present, we repead these operations but we only send the TX cache
entry ID.
Upon receipt, the couple of (cache IDs, server name) are stored the LRU cache used
only upon receipt (struct dcache_rx). As the peers protocol is symetrical, the fact
that the server name is present in the received data (resp. or not) denotes if
the entry is absent (resp. or not).
This simple patch only adds definitions to create a new stick-table
data type ID and a new standard type to store information in relation
wich dictionary entries (STD_T_DICT).
This patch adds minimalistic definitions to implement dictionary new data structure
which is an ebtree of ebpt_node structs with strings as keys. Note that this has nothing
to see with real dictionary data structure (maps of keys in association with values).
Have "socks4" and "check-via-socks4" server keyword added.
Implement handshake with SOCKS4 proxy server for tcp stream connection.
See issue #82.
I have the "SOCKS: A protocol for TCP proxy across firewalls" doc found
at "https://www.openssh.com/txt/socks4.protocol". Please reference to it.
[wt: for now connecting to the SOCKS4 proxy over unix sockets is not
supported, and mixing IPv4/IPv6 is discouraged; indeed, the control
layer is unique for a connection and will be used both for connecting
and for target address manipulation. As such it may for example report
incorrect destination addresses in logs if the proxy is reached over
IPv6]
Remove the active_tasks_mask variable, we can deduce if we've work to do
by other means, and it is costly to maintain. Instead, introduce a new
function, thread_has_tasks(), that returns non-zero if there's tasks
scheduled for the thread, zero otherwise.
The functions channel_htx_fwd_payload() and channel_htx_fwd_all() should now be
used to forward, respectively, a part of the HTX payload or all of it. These
functions forward data and update the first block position.
We don't store the start-line position anymore in the HTX message. Instead we
store the first block position to analyze. For now, it is almost the same. But
once all changes will be made on this part, this position will have to be used
by HTX analyzers, and only in the analysis context, to know where the analyse
should start.
When new blocks are added in an HTX message, if the first block position is not
defined, it is set. When the block pointed by it is removed, it is set to the
block following it. -1 remains the value to unset the position. the first block
position is unset when the HTX message is empty. It may also be unset on a
non-empty message, meaning every blocks were already analyzed.
From HTX analyzers point of view, this position is always set during headers
analysis. When they are waiting for a request or a response, if it is unset, it
means the analysis should wait. But once the analysis is started, and as long as
headers are not forwarded, it points to the message start-line.
As mentionned, outside the HTX analysis, no code must rely on the first block
position. So multiplexers and applets must always use the head position to start
a loop on an HTX message.
The function channel_htx_fwd_headers() should now be used by HTX analyzers to
forward all headers of an HTX message, from the start-line to the corresponding
EOH. It takes care to update the star-line position.
When channel_recv_max() is called for an HTX stream, we fall back on the HTX
version. This function is called from si_cs_recv(). This will let us pass the
max amount of bytes to read to HTX multiplexers.
Now, we only return the start-line. If not found, NULL is returned. No lookup is
performed and the HTX message is no more updated. It is now the caller
responsibility to update the position of the start-line to the right value. So
when it is not found, i.e sl_pos is set to -1, it means the last start-line has
been already processed and the next one has not been inserted yet.
It is mandatory to rely on this kind of warranty to store 1xx informational
responses and final reponse in the same HTX message.
It's amazing that the value was still incremented under the date lock,
let's first use an atomic increment for the counter and move it out of
the date lock to reduce contention. These are just counters, we don't
need to take locks if we're not rotating, atomic ops are enough. This
patch does this, and leaves the lock for when the period is over. It's
important to note that some values might be added just before or just
after a rotation but this is not a problem since we don't care if a
value is counted in the previous or next period when it's exactly on
the edge. Great care was taken to ensure that the current counter is
always atomically updated.
Other minor cleanups were performed, such as avoiding to reload the
value from memory after a CAS, or using &~1 instead of two shifts to
remove the lowest bit.
This function dumps a lot of information about a stream into the provided
buffer. It is now used by stream_dump_and_crash() and will be used by the
debugger as well.
The struct mworker_proc is not uniformly freed everywhere, sometimes leading
to leaks of the `id` string (and possibly the other strings).
Introduce a mworker_free_child function instead of duplicating the freeing
logic everywhere to prevent this kind of issues.
This leak was reported in issue #96.
It looks like the leaks have been introduced in commit 9a1ee7ac31,
which is specific to 2.0-dev. Backporting `mworker_free_child` might be
helpful to ease backporting other fixes, though.
Now that we have the guarantee that init calls happen before any other
thread starts, we don't need anymore the workaround installed by commit
1605c7ae6 ("BUG/MEDIUM: threads/mworker: fix a race on startup") and we
can instead rely on a regular per-thread initcall for this function. It
will only be performed on worker thread #0, the other ones and the master
have nothing to do, just like in the original code that was only moved
to the function.
With the thread debugger it becomes visible that we can leave some
wandering pointers for a while in curr_task, which is inappropriate.
This patch addresses this by resetting curr_task to NULL before really
freeing the area. This way it becomes safe even regarding signals.
In conn_xprt_close(), after calling xprt->close(), don't forget to set
conn->xprt_ctx to NULL, or we may attempt to reuse the now-free'd
conn->xprt_ctx if the connection failed and we're retrying it.
It's always a pain to have to stuff lots of #ifdef USE_OPENSSL around
ssl headers, it even results in some of them appearing in a random order
and multiple times just to benefit form an existing ifdef block. Let's
make these headers safe for inclusion when USE_OPENSSL is not defined,
they now perform the test themselves and do nothing if USE_OPENSSL is
not defined. This allows to remove no less than 8 such ifdef blocks
and make include blocks more readable.
Since we're providing a compatibility layer for multiple OpenSSL
implementations and their derivatives, it is important that no C file
directly includes openssl headers but only passes via openssl-compat
instead. As a bonus this also gets rid of redundant complex rules for
inclusion of certain files (engines etc).
LibreSSL causes lots of build issues by pretending to be OpenSSL 2.0.0,
and it requires lots of care for each #if added to cover any specific
OpenSSL features.
This commit addresses the problem by making LibreSSL only advertise the
version it forked from (1.0.1g) and by starting to use tests based on
its real version to enable features instead of working by exclusion.
Most tests on OPENSSL_VERSION_NUMBER have become complex and break all
the time because this number is fake for some derivatives like LibreSSL.
This patch creates a new macro, HA_OPENSSL_VERSION_NUMBER, which will
carry the real openssl version defining the compatibility level, and
this version will be adjusted depending on the variants.
When a sample fetch is encoded, we use its context to set info about the
fragmentation. But if the sample is not found, the function sample_process()
returns NULL. So we me be sure the sample exists before setting its context.
This patch must be backported to 1.9 and 1.8.
This patch implements a new global parameter for the master-worker mode.
When setting the mworker-max-reloads value, a worker receive a SIGTERM
if its number of reloads is greater than this value.
Since previous commit it's not needed anymore to test a task pointer
before calling task_destory() so let's just remove these tests from
the various callers before they become confusing. The function's
arguments were also documented. The same should probably be done
with tasklet_free() which involves a test in roughly half of the
call places.
Commit 3f795f7 ("MEDIUM: tasks: Merge task_delete() and task_free() into
task_destroy().") replaced task_delete() and task_free() with a single
function named task_destroy().
This patch adds a check for struct task* argument in function
task_destroy() to prevent a possible segfault on NULL and also to make
the function safer for use in other cases.
With this patch we add a prefix to stick-table names declared in "peers" sections
concatenating the "peers" section name followed by a '/' character with
the stick-table name. Consequently, "peers" sections have their own
namespace for their stick-tables. Obviously, these stick-table names are not the
ones which should be sent over the network. So these configurations must be
compatible and should make A and B peers communicate with peers protocol:
# haproxy A config, old way stick-table declerations
peers mypeers
peer A ...
peer B ...
backend t1
stick-table type string size 10m store gpc0 peers mypeers
# haproxy B config, new way stick-table declerations
peers mypeers
peer A ...
peer B ...
table t1 type string size store gpc0 10m
This "network" name is stored in ->nid new field of stktable struct. The "local"
stktable-name is still stored in ->id.
This patch adds the support for the "table" line parsing in "peers" sections
to declare stick-table in such sections. This also prevents the user from having
to declare dummy backends sections with a unique stick-table inside.
Even if still supported, this usage will become deprecated.
To do so, the ->table member of proxy struct which is a stktable struct is replaced
by a pointer to a stktable struct allocated at parsing time in src/cfgparse-listen.c
for the dummy stick-table backends and in src/cfgparse.c for "peers" sections.
This has an impact on the code for stick-table sample converters and on the stickiness
rules parsers which first store the name of the dummy before resolving the rules.
This patch replaces proxy_tbl_by_name() calls by stktable_find_by_name() calls
to lookup for stick-tables stored in "stktable_by_name" ebtree at parsing time.
There is only one remaining place where proxy_tbl_by_name() is used: src/hlua.c.
At several places in the code we relied on the fact that ->size member of stick-table
was equal to zero to consider the stick-table was present by not configured,
this do not make sense anymore as ->table member of struct proxyis fow now on a pointer.
These tests are replaced by a test on ->table value itself.
In "peers" section we do not have to temporary store the name of the section the
stick-table are attached to because this name is obviously already known just after
having entered this "peers" section.
About the CLI stick-table I/O handler, the pointer to proxy struct is replaced by
a pointer to a stktable struct.
With this patch we move the code responsible of parsing "stick-table"
lines to implement parse_stick_table() function in src/stick-tabble.c
so that to be able to parse "stick-table" elsewhere than in proxy sections.
We have have also added a conf struct to stktable struct to store the filename
and the line in the file the stick-table has been parsed to help in
diagnosing and displaying any configuration issue.
This implements support for the new API which relies on a call to
setsockopt().
On systems that support it (currently, only Linux >= 4.11), this enables
using TCP fast open when connecting to server.
Please note that you should use the retry-on "conn-failure", "empty-response"
and "response-timeout" keywords, or the request won't be able to be retried
on failure.
Co-authored-by: Olivier Houchard <ohouchard@haproxy.com>
The connect() method had 2 arguments, "data", that tells if there's pending
data to be sent, and "delack" that tells if we have to use a delayed ack
inconditionally, or if the backend is configured with tcp-smart-connect.
Turn that into one argument, "flags".
That way it'll be easier to provide more informations to connect() without
adding extra arguments.
SSL_SESSION_get0_id_context is introduced in LibreSSL-2.7.0
async operations are not supported by LibreSSL
early data is not supported by LibreSSL
packet_length is removed from SSL struct in LibreSSL
When running in HTX mode, if we sent the request, but failed to get the
answer, either because the server just closed its socket, we hit a server
timeout, or we get a 404, 408, 425, 500, 501, 502, 503 or 504 error,
attempt to retry the request, exactly as if we just failed to connect to
the server.
To do so, add a new backend keyword, "retry-on".
It accepts a list of keywords, which can be "none" (never retry),
"conn-failure" (we failed to connect, or to do the SSL handshake),
"empty-response" (the server closed the connection without answering),
"response-timeout" (we timed out while waiting for the server response),
or "404", "408", "425", "500", "501", "502", "503" and "504".
The default is "conn-failure".
When for some reason the session is not the owner of the connection anymore,
make sure we remove CO_FL_SESS_IDLE, even if we're about to call
conn->mux->destroy(), as the destroy may not destroy the connection
immediately if it's still in use.
This should be backported to 1.9.
u
In channel_erase(), don't forget to set output to 0, otherwise the
channel won't seem empty, when it really is, and that could lead to
stream never closing properly.
This should be backported to 1.9.
Port range uses a ring buffer, and unfortunately, when making haproxy
multithreaded, it's been overlooked, and the ring buffer is not thread-safe.
When specifying a source range, 2 or more threads could pick the same
port, and of course only one of them could use the port, the others would
always fail the connection.
To fix this, make it a lock-free ring buffer. This is easier than usual
because we know the ring buffer can never be full.
This should be backported to 1.8 and 1.9.
This patch implements the sampling and load-balancing of log servers configured
with "sample" new keyword implemented by this commit:
'MINOR: log: Add "sample" new keyword to "log" lines'.
As the list of ranges used to sample the log to balance is ordered, we only
have to maintain ->curr_idx member of smp_info struct which is the index of
the sample and check if it belongs or not to the current range to decide if we
must send it to the log server or not.
This simplifies the API and hide the details in the sample. This way, only
string and binary are aware of these info, because other types cannot be
partially encoded.
This patch may be backported to 1.9 and 1.8.
Fragmented arg will do fetch at every encode time, each fetch may get
different result if SMP_F_MAY_CHANGE, for example res.payload, but
the length already encoded in first fragment of the frame, that will
cause SPOA decode failed and waste resources.
This patch must be backported to 1.9 and 1.8.
During 1.9 development (and even a bit after) we've started to face a
significant number of situations where streams were abusively spinning
due to an uncaught error flag or complex conditions that couldn't be
correctly identified. Sometimes streams wake appctx up and conversely
as well. More importantly when this happens the only fix is to restart.
This patch adds a new function to report a serious error, some relevant
info and to crash the process using abort() so that a core dump is
available. The purpose will be for this function to be called in various
situations where the process is unfixable. It will help detect these
issues much earlier during development and may even help fixing test
platforms which are able to automatically restart when such a condition
happens, though this is not the primary purpose.
This patch only provides the function and doesn't use it yet.
We have various functions like conn_get_ctrl_name() to retrieve
some information reported in "show sess" for debugging, which
assume that the connection is valid. This is really not convenient
in code aimed at debugging and is error-prone. Let's add a validity
test first.
Instead of having to manually turn task profiling on/off in the
configuration, by default it will work in "auto" mode, which
automatically turns on on any thread experiencing sustained loop
latencies over one millisecond averaged over the last 1024 samples.
This may happen with configs using lots of regex (thing map_reg for
example, which is the lazy way to convert Apache's rewrite rules but
must not be abused), and such high latencies affect all the process
and the problem is most often intermittent (e.g. hitting a map which
is only used for certain host names).
Thus now by default, with profiling set to "auto", it remains off all
the time until something bad happens. This also helps better focus on
the issues when looking at the logs as well as in "show sess" output.
It automatically turns off when the average loop latency over the last
1024 calls goes below 990 microseconds (which typically takes a while
when in idle).
This patch could be backported to stable versions after a bit more
exposure, as it definitely improves observability and the ability to
quickly spot the culprit. In this case, previous patch ("MINOR:
activity: make the profiling status per thread and not global") must
also be taken.
In order to later support automatic profiling turn on/off, we need to
have it per-thread. We're keeping the global option to know whether to
turn it or on off, but the profiling status is now set per thread. We're
updating the status in activity_count_runtime() which is called before
entering poll(). The reason is that we'll extend this with run time
measurement when deciding to automatically turn it on or off.
Very similarly to previous commit doing the same for streams, we now
measure and report an appctx's call rate. This will help catch applets
which do not consume all their data and/or which do not properly report
that they're waiting for something else. Some of them like peers might
theorically be able to exhibit some occasional peeks when teaching a
full table to a nearby peer (e.g. the new replacement process), but
nothing close to what a bogus service can do so there is no risk of
confusion.
Since the commit 1d2b586cd ("MAJOR: htx: Enable the HTX mode by default for all
proxies"), the HTX is enabled by default for all proxies, HTTP and TCP, but also
CLI and HEALTH proxies. But when the best mux is retrieved, only HTTP and TCP
modes are checked. If the TCP mode is not explicitly set, it is considered as an
HTTP proxy. It is an hidden bug introduced when the option "http-use-htx" was
added. It has no effect until the commit 1d2b586cd. But now, when a stats socket
is created for the master process, the mux h1 is installed on all incoming
connections to the CLI proxy, leading to segfaults because HTX operations are
performed on raw buffers.
So to fix the buf, when a mux is installed, all proxies are considered as TCP
proxies, except HTTP ones. This way, CLI and HEALTH proxies will be handled as
TCP proxies.
This patch must be backported to 1.9 although it has no effect. It is safer to
not keep hidden bugs.
The 'do-resolve' action is an http-request or tcp-request content action
which allows to run DNS resolution at run time in HAProxy.
The name to be resolved can be picked up in the request sent by the
client and the result of the resolution is stored in a variable.
The time the resolution is being performed, the request is on pause.
If the resolution can't provide a suitable result, then the variable
will be empty. It's up to the admin to take decisions based on this
statement (return 503 to prevent loops).
Read carefully the documentation concerning this feature, to ensure your
setup is secure and safe to be used in production.
This patch creates a global counter to track various errors reported by
the action 'do-resolve'.
As by default we add all keepalive connections to the idle pool, if we run
into a pathological case, where all client don't do keepalive, but the server
does, and haproxy is configured to only reuse "safe" connections, we will
soon find ourself having lots of idling, unusable for new sessions, connections,
while we won't have any file descriptors available to create new connections.
To fix this, add 2 new global settings, "pool_low_ratio" and "pool_high_ratio".
pool-low-fd-ratio is the % of fds we're allowed to use (against the maximum
number of fds available to haproxy) before we stop adding connections to the
idle pool, and destroy them instead. The default is 20. pool-high-fd-ratio is
the % of fds we're allowed to use (against the maximum number of fds available
to haproxy) before we start killing idling connection in the event we have to
create a new outgoing connection, and no reuse is possible. The default is 25.
For most of the xprt methods, provide a xprt_ctx. This will be useful later
when we'll want to be able to stack xprts.
The init() method now has to create and provide the said xprt_ctx if needed.
task_delete() was never used without calling task_free() just after, and
task_free() was only used on error pathes to destroy a just-created task,
so merge them into task_destroy(), that will remove the task from the
wait queue, and make sure the task is either destroyed immediately if it's
not in the run queue, or destroyed when it's supposed to run.
The wakeup condition in task_wakeup() is redundant as it is already
validated by the CAS. Better move the __task_wakeup() call there, it
also has the merit of being easier to audit this way. This also reduces
the code size by around 1.8 kB :
$ size haproxy-?
text data bss dec hex filename
2153806 100208 1307676 3561690 3658da haproxy-1
2152094 100208 1307676 3559978 36522a haproxy-2
Commit 0c7a4b6 ("MINOR: tasks: Don't set the TASK_RUNNING flag when
adding in the tasklet list.") revealed a hole in the way tasks may
be freed : they could be removed while in the run queue when the
TASK_QUEUED flag was present but not the TASK_RUNNING one. But it
seems the issue was emphasized by commit cde7902 ("MEDIUM: tasks:
improve fairness between the local and global queues") though the
code it replaces was already affected given how late the TASK_RUNNING
flag was set after removal from the global queue.
At the moment the task is picked from the global run queue, if it
is the last one, the global run queue lock is dropped, and then
the TASK_RUNNING flag was added. In the mean time another thread
might have performed a task_free(), and immediately after, the
TASK_RUNNING flag was re-added to the task, which was then added
to the tasklet list. The unprotected window was extremely faint
but does definitely exist and inconsistent task lists have been
observed a few times during very intensive tests over the last few
days. From this point various options are possible, the task might
have been re-allocated while running, and assigned state 0 and/or
state QUEUED while it was still running, resulting in the tast not
being put back into the tree.
This commit simply makes sure that tests on TASK_RUNNING before removing
the task also cover TASK_QUEUED.
It must be backported to 1.9 along with the previous ones touching
that area.
Now that we have the warranty that a task won't be added in the runqueue
while the TASK_QUEUED or the TASK_RUNNING flag is set, don't bother trying
to lock the task by setting leaf_p to 0x1 while inserting it in the runqueue
or having it in the tasklet_list, as nobody else will attempt to add it.
Make sure we set TASK_QUEUED in every case before adding the task to the
run queue. task_wakeup() now checks if either TASK_QUEUED or TASK_RUNNING
is set, and if neither is set, add TASK_QUEUED and effectively add the task
to the runqueue.
No longer use __task_wakeup() anywhere except in task_wakeup(), always use
task_wakeup() instead.
With the old code, process_runnable_task() may re-add a task in the runqueue
without setting the TASK_QUEUED flag, and there were race conditions that could
lead to a task having the TASK_QUEUED flag but not in the runqueue, thus
being unschedulable.
This should be backported to 1.9.
As for smp_prefetch_http(), there is now a way to successfully perform a
prefetch in HTX, even if the message forwarding already begun. It is used for
the sample fetches "req.proto_http" and "method".
This patch must be backported to 1.9.
All HTTP samples are buggy because the channel tested in the prefetch functions
(HTX and legacy HTTP) is chosen depending on the sample direction and not the
keyword really used. It means the request channel is used if the sample is
called during the request analysis and the response channel is used if it is
called during the response analysis, regardless the sample really called. For
instance, if you use the sample "req.ver" in an http-response rule, the response
channel will be prefeched because it is called during the response analysis,
while the request channel should have been used instead. So some assumptions on
the validity of the sample may be made on the wrong channel. It is the first
bug.
Then the same error is done in some samples themselves. So fetches are performed
on the wrong channel. For instance, the header extraction (req.fhdr, res.fhdr,
req.hdr, res.hdr...). If the sample "req.hdr" is used in an http-response rule,
then the matching is done on the response headers and not the request ones. It
is the second bug.
Finally, the last one but not the least, in some samples, the right channel is
used. But because the prefetch was done on the wrong one, this channel may be in
a undefined state. For instance, using the sample "req.ver" in an http-response
rule leads to a matching on a posibility released buffer.
To fix all these bugs, the right channel is now chosen in sample fetches, before
the prefetch. If the same function is used to fetch requests and responses
elements, then the keyword is used to choose the right one. This channel is then
used by the functions smp_prefetch_htx() and smp_prefetch_http(). Of course, it
is also used by the samples themselves to extract information.
This patch must be backported to all supported versions. For version 1.8 and
priors, it must be totally refactored. First because there is no HTX into these
versions. Then the buffers API has changed in HAProxy 1.9. The files
http_fetch.{ch} doesn't exist on old versions.
Maksim Kupriianov reported very strange crashes in fwrr_update_position()
which didn't make sense because of an apparent divide overflow except that
the value was not null in the core.
It happens that while the locking is correct in all the functions' call
graph, the uppermost one (fwrr_get_next_server()) incorrectly expected
that its target server was already locked when called. This stupid
assumption causd the server lock not to be held when calling the other
ones, explaining how it was possible to change the server's eweight by
calling srv_lb_commit_status() under the server lock yet collide with
its unprotected usage.
This commit makes sure that fwrr_get_server_from_group() retrieves a
locked server and that fwrr_get_next_server() is responsible for
unlocking the server before returning it. There is one subtlety in
this function which is that it builds a list of avoided servers that
were full while scanning the tree, and all of them are queued in a
full state so they must be unlocked upon return.
Many thanks to Maksim for providing detailed info allowing to narrow
down this bug.
This fix must be backported to 1.9. In 1.8 the lock seems much wider
and changes to the server's state are performed under the rendez-vous
point so this it doesn't seem possible that it happens there.
Older compilers don't like to see "inline" placed after the type in a
function declaration, it must be "static inline <type>" only. This
patch touches various areas. The warnings were seen with gcc-3.4.
This one hasn't been used anymore since the scheduler changes after 1.8
but it kept being exported and maintained up to date while it's always
reset when scanning the trees. Let's stop exporting it and updating it.
This function will handle mux upgrades, for frontend connections only. It will
retrieve the best mux in the same way than conn_install_mux_fe except that the
mode and optionnally the proto are forced.
The new multiplexer is initialized using a new context and a specific input
buffer. Then, the old one is destroyed. If an error occurred, everything is
rolled back.
It is mandatory to handle mux upgrades, because during a mux upgrade, the
connection will be reassigned to another multiplexer. So when the old one is
destroyed, it does not own the connection anymore. Or in other words, conn->ctx
does not point to the old mux's context when its destroy() callback is
called. So we now rely on the multiplexer context do destroy it instead of the
connection.
In addition, h1_release() and h2_release() have also been updated in the same
way.
The mux's callback init() now take a pointer to a buffer as extra argument. It
must be used by the multiplexer as its input buffer. This buffer is always NULL
when a multiplexer is initialized with a fresh connection. But if a mux upgrade
is performed, it may be filled with existing data. Note that, for now, mux
upgrades are not supported. But this commit is mandatory to do so.
Because the option http-tunnel is now ignored in HTX, there is no longer any
need to adjust the transaction mode in HTX analyzers. A channel can still be
switch to the tunnel mode for legitimate cases (HTTP CONNECT or switching
protocols). So the function htx_adjust_conn_mode() is now useless.
This patch must be backported to 1.9. It is not strictly speaking required but
it will ease futur backports.
This patch implements the external binary support in the master worker.
To configure an external process, you need to use the program section,
for example:
program dataplane-api
command ./dataplane_api
Those processes are launched at the same time as the workers.
During a reload of HAProxy, those processes are dealing with the same
sequence as a worker:
- the master is re-executed
- the master sends a USR1 signal to the program
- the master launches a new instance of the program
During a stop, or restart, a SIGTERM is sent to the program.
The children variable is still used in haproxy, it is not required
anymore since we have the information about the current workers in the
mworker_proc linked list.
The oldpids array is also replaced by this linked list when we
generated the arguments for the master reexec.
I found on an (old) AIX 5.1 machine that stdint.h didn't exist while
inttypes.h which is expected to include it does exist and provides the
desired functionalities.
As explained here, stdint being just a subset of inttypes for use in
freestanding environments, it's probably always OK to switch to inttypes
instead:
https://pubs.opengroup.org/onlinepubs/009696799/basedefs/stdint.h.html
Also it's even clearer here in the autoconf doc :
https://www.gnu.org/software/autoconf/manual/autoconf-2.61/html_node/Header-Portability.html
"The C99 standard says that inttypes.h includes stdint.h, so there's
no need to include stdint.h separately in a standard environment.
Some implementations have inttypes.h but not stdint.h (e.g., Solaris
7), but we don't know of any implementation that has stdint.h but not
inttypes.h"
Previous commit 3ea351368 ("BUG/MEDIUM: h2: Remove the tasklet from the
task list if unsubscribing.") uncovered an issue which needs to be
addressed in the scheduler's API. The function task_remove_from_task_list()
was initially designed to remove a task from the running tasklet list from
within the scheduler, and had to be used in h2 to abort pending I/O events.
However this function was not designed to be idempotent, occasionally
causing a double removal from the tasklet list, with the second doing
nothing but affecting the apparent tasks count and making haproxy use
100% CPU on some tests consisting in stopping the client during some
transfers. The h2_unsubscribe() function can sometimes be called upon
stream exit after an error where the tasklet was possibly already
removed, so it.
This patch does 2 things :
- it renames task_remove_from_task_list() to
__task_remove_from_tasklet_list() to discourage users from calling
it. Also note the fix in the naming since it's a tasklet list and
not a task list. This function is still uesd from the scheduler.
- it adds a new, idempotent, task_remove_from_tasklet_list() function
which does nothing if the task is already not in the tasklet list.
This patch will need to be backported where the commit above is backported.
It's never easy to guess what services are built in. We currently have
the prometheus exporter in contrib/ which is the only extension for now.
Let's enumerate all available ones just like we do for filterr and pollers.
The status codes definition (STAT_STATUS_*) and their string representation
stat_status_codes) have been moved in stats files. There is no reason to keep
them in proto_http files.
In task_unlink_rq, to decide if we should logk the global runqueue lock,
use the TASK_GLOBAL flag instead of relying on t->thread_mask being tid_bit,
as it could be so while still being in the global runqueue if another thread
woke that task for us.
This should be backported to 1.9.
In this function we end up with successive locked operations then a
store barrier, and in addition the compiler has to emit less efficient
code due to a longer jump. There's no need for absolutely updating the
tasks_run_queue counter before clearing the task's leaf pointer, so
let's swap the two operations and benefit from a single barrier as much
as possible. This code is on the hot path and shows about half a percent
of improvement with 8 threads.
By using LIST_DEL_INIT() instead of LIST_DEL()+LIST_INIT() we manage
to bump the peak connection rate by no less than 3% on 8 threads.
The perf top profile shows much less contention in this area which
suffered from the second reload.
We move the code responsible of parsing protocol buffers messages
inside gRPC messages from sample.c to include/proto/protocol_buffers.h
so that to reuse it to cascade "ungrpc" converter.
For now on, "ungrpc" may take a second optional argument to provide
the protocol buffers types used to encode the field value to be extracted.
When absent the field value is extracted as a binary sample which may then
followed by others converters like "hex" which takes binary as input sample.
When this second argument is a type which does not match the one found by "ungrpc",
this field is considered as not found even if present.
With this patch we also remove the useless "varint" and "svarint" converters.
Update the documentation about "ungrpc" converters.
Parsing protocol buffer fields always consists in skip the field
if the field is not found or store the field value if found.
So, with this patch we factorize a little bit the code for "ungrpc" converter.
Historically the default frontend's maxconn used to be quite low (2000),
which was sufficient two decades ago but often proved to be a problem
when users had purposely set the global maxconn value but forgot to set
the frontend's.
There is no point in keeping this arbitrary limit for frontends : when
the global maxconn is lower, it's already too high and when the global
maxconn is much higher, it becomes a limiting factor which causes trouble
in production.
This commit allows the value to be set to zero, which becomes the new
default value, to mean it's not directly limited, or in fact it's set
to the global maxconn. Since this operation used to be performed before
computing a possibly automatic global maxconn based on memory limits,
the calculation of the maxconn value and its propagation to the backends'
fullconn has now moved to a dedicated function, proxy_adjust_all_maxconn(),
which is called once the global maxconn is stabilized.
This comes with two benefits :
1) a configuration missing "maxconn" in the defaults section will not
limit itself to a magically hardcoded value but will scale up to the
global maxconn ;
2) when the global maxconn is not set and memory limits are used instead,
the frontends' maxconn automatically adapts, and the backends' fullconn
as well.
In an attempt to try to provide automatic maxconn settings, we need to
decorrelate a listner's backlog and maxconn so that these values can be
independent. This introduces a listener_backlog() function which retrieves
the backlog value from the listener's backlog, the frontend's, the
listener's maxconn, the frontend's or falls back to 1024. This
corresponds to what was done in cfgparse.c to force a value there except
the last fallback which was not set since the frontend's maxconn is always
known.
Since LIST_DEL_LOCKED() and LIST_POP_LOCKED() now automatically reinitialize
the removed element, there's no need for keeping this LIST_INIT() call in the
idle connection code.
There is one point where we can migrate a connection to another thread
without taking risk, it's when we accept it : the new FD is not yet in
the fd cache and no task was created yet. It's still possible to assign
it a different thread than the one which accepted the connection. The
only requirement for this is to have one accept queue per thread and
their respective processing tasks that have to be woken up each time
an entry is added to the queue.
This is a multiple-producer, single-consumer model. Entries are added
at the queue's tail and the processing task is woken up. The consumer
picks entries at the head and processes them in order. The accept queue
contains the fd, the source address, and the listener. Each entry of
the accept queue was rounded up to 64 bytes (one cache line) to avoid
cache aliasing because tests have shown that otherwise performance
suffers a lot (5%). A test has shown that it's important to have at
least 256 entries for the rings, as at 128 it's still possible to fill
them often at high loads on small thread counts.
The processing task does almost nothing except calling the listener's
accept() function and updating the global session and SSL rate counters
just like listener_accept() does on synchronous calls.
At this point the accept queue is implemented but not used.
In order to quickly pick a thread ID when accepting a connection, we'll
need to know certain pre-computed values derived from the thread mask,
which are counts of bits per position multiples of 1, 2, 4, 8, 16 and
32. In practice it is sufficient to compute only the 4 first ones and
store them in the bind_conf. We update the count every time the
bind_thread value is adjusted.
The fields in the bind_conf struct have been moved around a little bit
to make it easier to group all thread bit values into the same cache
line.
The function used to return a thread number is bind_map_thread_id(),
and it maps a number between 0 and 31/63 to a thread ID between 0 and
31/63, starting from the left.
Instead of having one task per thread and per server that does clean the
idling connections, have only one global task for every servers.
That tasks parses all the servers that currently have idling connections,
and remove half of them, to put them in a per-thread list of connections
to kill. For each thread that does have connections to kill, wake a task
to do so, so that the cleaning will be done in the context of said thread.