There is a small unprotected window for a task between the wait queue
and the run queue where a task could be woken up and destroyed at the
same time. What typically happens is that a timeout is reached at the
same time an I/O completes and wakes it up, and the I/O terminates the
task, causing a use after free in wake_expired_tasks() possibly causing
a crash and/or memory corruption :
thread 1 thread 2
(wake_expired_tasks) (stream_int_notify)
HA_SPIN_UNLOCK(TASK_WQ_LOCK, &wq_lock);
task_wakeup(task, TASK_WOKEN_IO);
...
process_stream()
stream_free()
task_free()
pool_free(task)
task_wakeup(task, TASK_WOKEN_TIMER);
This case is reasonably easy to reproduce with a config using very short
server timeouts (100ms) and client timeouts (10ms), while injecting on
httpterm requesting medium sized objects (5kB) over SSL. All this is
easier done with more threads than allocated CPUs so that pauses can
happen anywhere and last long enough for process_stream() to kill the
task.
This patch inverts the lock and the wakeup(), but requires some changes
in process_runnable_tasks() to ensure we never try to grab the WQ lock
while having the RQ lock held. This means we have to release the RQ lock
before calling task_queue(), so we can't hold the RQ lock during the
loop and must take and drop it.
It seems that a different approach with the scope-aware trees could be
easier, but it would possibly not cover situations where a task is
allowed to run on multiple threads. The current solution covers it and
doesn't seem to have any measurable performance impact.
When a stream is aborted on timeout or any reason initiated by the stream,
and this stream was subscribed to the send list, we forgot to detach it
when freeing it, resulting in a dead node remaining present in the send
list with all usual funny consequences (memory corruption, crashes, etc).
Let's simply unconditionally delete the stream.
When the stats code was moved to an applet, it wasn't completely
cleaned of its usage of the HTTP transaction and it used to store
the HTTP status in txn->status and to set the HTTP request date to
<now> from within the applet. This is totally wrong because the
applet is seen as a server from the HTTP engine, which parses its
response, so the http_txn must not be touched there.
This was made visible by the cache which would always exhibit a
negative TR log, indicating that nowhere in the code we took care of
setting s->logs.tv_request while the code above used to continue to
hide this. Another side effect of this issue is that under load, if
the stats applet call risks to be delayed, the reported t_queue can
appear negative by being below tv_request-tv_accept.
This patch removes the assignment of tv_request and txn->status from
the applet code and instead sets the tv_request if still unset when
connecting to the applet. This ensures that all applets report correct
request timers now.
During high loads it becomes visible that the time drifts between threads,
sometimes showing tens of seconds after several minutes. The root cause is
the per-thread correction which is performed based on a local offset and
local time. But we can't use a unique global time either as we need the
thread-local time to be stable between two poll() calls.
This commit takes a stab at this problem by proceeding this way :
- a global "global_now" date is monotonous and common between all threads.
- each thread has its own local <now> which is resynced with <global_now>
on each invocation of tv_update_date()
- each thread detects its own drift based on its poll() timeout and its
local <now>, and recalculates its adjusted local time
- each thread then ensures its new local time is no older than the current
global time, otherwise it readjusts its local time to match this one
- finally threads do atomically update the global time to match its own
local one
This guarantees a monotonous global time and a monotonous+stable local time.
It is still possible by definition for two threads to report a minor time
variation on subsequent events but that variation will only be caused by
the moment they watched the time and are very small. When a common global
time is needed between all threads, global_now could be used as a reference
(with care). The wallclock time used in logs is still <date> anyway.
With threads, it became mandatory to implement a thread-local time with
its own correction. However, it was noticed that during high thread
contention, the time correction could occasionally be wrong, reporting
huge negative or positive timers in logs. This was caused by the
conversion between struct timeval and a single 64-bit offset, due to
an erroneous shift and due to a loss of sign during the conversion.
Given that time_t is not always signed, and that timeval is not really
needed here, better avoid playing dangerous games with these operations
and use a single 64-bit offset representing a signed 32-bit offset, for
the seconds part and an unsigned offset for the microsecond part.
It still supports atomic updates and doesn't cause issues anymore.
If we can't write early data, for some reason, don't give up on reading them,
they may still be early data to be read, and if we don't do so, openssl
internal states might be inconsistent, and the handshake will fail.
The current code only tries to do the handshake in case we can't send early
data if we're acting as a client, which is wrong, it has to be done on the
server side too, or we end up in an infinite loop.
While using mmap() to allocate pools for debugging purposes, kill -USR1 caused
libc aborts in deinit() on two calls to free() on proxies' tasks and the global
listener task. The issue comes from the fact that we're using free() to release
a task instead of task_free(), so the task was allocated from a pool and released
using a different method.
This bug has been there since at least 1.5, so a backport is desirable to all
maintained versions.
The cache was trying to remove objects from the tree while they were
already removed from it. We set the key to 0 as a check for not trying
to remove the object from the tree when we are still using the object.
The cli command "show cache" displays the status of the cache, the first
displayed line is the shctx informations with how much blocks available
blocks it contains (blocks are 1k by default).
The next lines are the objects stored in the cache tree, the pointer,
the size of the object and how much blocks it uses, a refcount for the
number of users of the object, and the remaining expiration time (which
can be negative if expired)
Example:
$ echo "show cache" | socat - /run/haproxy.sock
0x7fa54e9ab03a: foobar (shctx:0x7fa54e9ab000, available blocks:3921)
0x7fa54ed65b8c (size: 43190 (43 blocks), refcount:2, expire: 2)
0x7fa54ecf1b4c (size: 45238 (45 blocks), refcount:0, expire: 2)
0x7fa54ed70cec (size: 61622 (61 blocks), refcount:0, expire: 2)
0x7fa54ecdbcac (size: 42166 (42 blocks), refcount:1, expire: 2)
0x7fa54ec9736c (size: 44214 (44 blocks), refcount:2, expire: 2)
0x7fa54eca28ec (size: 46262 (46 blocks), refcount:2, expire: -2)
Since we switched to notify mode in the systemd unit file in commit
d6942c8, haproxy won't start if the daemon keyword is present in the
configuration.
This change makes sure that haproxy remains in foreground when using
systemd mode and adds a note in the documentation.
The special case of the Cookie header field was overlooked in the
implementation, considering that most servers do handle cookie lists,
but as reported here on discourse it's not the case at all :
https://discourse.haproxy.org/t/h2-cookie-header-splitted-header/1742
This patch fixes this by skipping all occurences of the Cookie header
in the request while building the H1 request, and then building a single
Cookie header with all values appended at once, according to what is
requested in RFC7540#8.1.2.5.
In order to build the list of values, the list struct is used as a linked
list (as there can't be more cookies than headers). This makes the list
walking quite efficient and ensures all values are quickly found without
having to rescan the list.
A test case provided by Lukas shows that it properly works :
> GET /? HTTP/1.1
> user-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0
> accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> accept-language: en-US,en;q=0.5
> accept-encoding: gzip, deflate
> referer: https://127.0.0.1:4443/?expectValue=1511294406
> host: 127.0.0.1:4443
< HTTP/1.1 200 OK
< Server: nginx
< Date: Tue, 21 Nov 2017 20:00:13 GMT
< Content-Type: text/html; charset=utf-8
< Transfer-Encoding: chunked
< Connection: keep-alive
< X-Powered-By: PHP/5.3.10-1ubuntu3.26
< Set-Cookie: HAPTESTa=1511294413
< Set-Cookie: HAPTESTb=1511294413
< Set-Cookie: HAPTESTc=1511294413
< Set-Cookie: HAPTESTd=1511294413
< Content-Encoding: gzip
> GET /?expectValue=1511294413 HTTP/1.1
> user-agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Firefox/45.0
> accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> accept-language: en-US,en;q=0.5
> accept-encoding: gzip, deflate
> host: 127.0.0.1:4443
> cookie: SERVERID=s1; HAPTESTa=1511294413; HAPTESTb=1511294413; HAPTESTc=1511294413; HAPTESTd=1511294413
Many thanks to @Nurza, @adrianw and @lukastribus for their helpful reports
and investigations here.
The current H2 to H1 protocol conversion presents some issues which will
require to perform some processing on certain headers before writing them
so it's not possible to convert HPACK to H1 on the fly.
This commit modifies the headers decoding so that it now works in two
phases : hpack_decode_headers() only decodes the HPACK stream in the
HEADERS frame and puts the result into a list. Headers which require
storage (huffman-compressed or from the dynamic table) are stored in
a chunk allocated by the H2 demuxer. Then once the headers are properly
decoded into this list, h2_make_h1_request() is called with this list
to produce the HTTP/1.1 request into the destination buffer. The list
necessarily enforces a limit. Here we use 2*MAX_HTTP_HDR, which means
that we can have as many individual cookies as we have regular headers
if a client decides to break their cookies into multiple values. This
seams reasonable and will allow the H1 parser to decide whether it's
too much or not.
Thus the output stream is not produced on the fly anymore and this will
permit to deal with certain corner cases like reparing the Cookie header
(which for now is not done).
In order to limit header duplication and parsing, the known pseudo headers
continue to be passed by their index : the name element in the list then
has a NULL pointer and the value is the pseudo header's index. Given that
these ones represent about half of the incoming requests and need to be
found quickly, it maintains an acceptable level of performance.
The code was significantly reduced by doing this because the orignal code
had to deal with HPACK and H1 combinations (eg: index vs not indexed, etc)
and now the HPACK decoding is totally focused on the decompression, and
the H1 encoding doesn't have to deal with the issue of wrapping input for
example.
One bug was addressed here (though it couldn't happen at the moment). The
H2 demuxer used to detect a failure to write the request into the H1 buffer
and would then detect if the output buffer wraps, realign it and try again.
The problem by doing so was that the HPACK context was already modified and
not rewindable. Thus the size check is now performed first and a failure is
reported if it doesn't fit.
The current H2 to H1 protocol conversion presents some issues which will
require to perform some processing on certain headers before writing them
so it's not possible to convert HPACK to H1 on the fly.
Here we introduce a function which performs half of what hpack_decode_header()
used to do, which is to take a list of headers on input and emit the
corresponding request in HTTP/1.1 format. The code is the same and functions
were renamed to be prefixed with "h2" instead of "hpack", though it ends
up being simpler as the various HPACK-specific cases could be fused into
a single one (ie: add header).
Moving this part here makes a lot of sense as now this code is specific to
what is documented in HTTP/2 RFC 7540 and will be able to deal with special
cases related to H2 to H1 conversion enumerated in section 8.1.
Various error codes which were previously assigned to HPACK were never
used (aside being negative) and were all replaced by -1 with a comment
indicating what error was detected. The code could be further factored
thanks to this but this commit focuses on compatibility first.
This code is not yet used but builds fine.
We used to return >0 indicating a success when an error was present on the
connection, preventing the caller from detecting and handling it. This for
example happens when sending too many headers in a frame, making the request
impossible to decompress.
Clang reports this warning :
src/server.c:872:14: warning: address of array 'check->desc' will
always evaluate to 'true' [-Wpointer-bool-conversion]
Indeed, check->desc used to be a pointer to a dynamically allocated area
a long time ago and is now an array. Let's remove the useless test.
Clang complains that h2_get_n64() is not used, and a few other protocol
specific functions may fall in that category depending on how the code
evolves. Better mark them unused to silence the warning since it's on
purpose.
Call the shctx free_blocks callback in order to remove the row from the
cache tree.
Put the row in the hot list during allocation, forbid the blocks to be
stolen by a free or a row_reserve
This patch adds support for `Type=notify` to the systemd unit.
Supporting `Type=notify` improves both starting as well as reloading
of the unit, because systemd will be let known when the action completed.
See this quote from `systemd.service(5)`:
> Note however that reloading a daemon by sending a signal (as with the
> example line above) is usually not a good choice, because this is an
> asynchronous operation and hence not suitable to order reloads of
> multiple services against each other. It is strongly recommended to
> set ExecReload= to a command that not only triggers a configuration
> reload of the daemon, but also synchronously waits for it to complete.
By making systemd aware of a reload in progress it is able to wait until
the reload actually succeeded.
This patch introduces both a new `USE_SYSTEMD` build option which controls
including the sd-daemon library as well as a `-Ws` runtime option which
runs haproxy in master-worker mode with systemd support.
When haproxy is running in master-worker mode with systemd support it will
send status messages to systemd using `sd_notify(3)` in the following cases:
- The master process forked off the worker processes (READY=1)
- The master process entered the `mworker_reload()` function (RELOADING=1)
- The master process received the SIGUSR1 or SIGTERM signal (STOPPING=1)
Change the unit file to specify `Type=notify` and replace master-worker
mode (`-W`) with master-worker mode with systemd support (`-Ws`).
Future evolutions of this feature could include making use of the `STATUS`
feature of `sd_notify()` to send information about the number of active
connections to systemd. This would require bidirectional communication
between the master and the workers and thus is left for future work.
Commit 9aaf778 ("MAJOR: connection : Split struct connection into struct
connection and struct conn_stream.") had to change the way the stream
interface deals with incoming data to accomodate the mux. A break
statement got lost during a change, leading to the receive call being
performed twice even when CF_READ_DONTWAIT is set. The most noticeable
effect is that it made the bug described in commit 33982cb ("BUG/MAJOR:
stream: ensure analysers are always called upon close") much easier to
reproduce as it would appear even with an HTTP frontend.
Let's just restore the stream-interface flag and the break here, as in
the previous code.
No backport is needed as this was introduced during 1.8-dev.
A recent issue affecting HTTP/2 + redirect + cache has uncovered an old
problem affecting all existing versions regarding the way events are
reported to analysers.
It happens that when an event is reported, analysers see it and may
decide to temporarily pause processing and prevent other analysers from
processing the same event. Then the event may be cleared and upon the
next call to the analysers, some of them will never see it.
This is exactly what happens with CF_READ_NULL if it is received before
the request is processed, like during redirects : the first time, some
analysers see it, pause, then the event may be converted to a SHUTW and
cleared, and on next call, there's nothing to process. In practice it's
hard to get the CF_READ_NULL flag during the request because requests
have CF_READ_DONTWAIT, preventing the read0 from happening. But on
HTTP/2 it's presented along with any incoming request. Also on a TCP
frontend the flag is not set and it's possible to read the NULL before
the request is parsed.
This causes a problem when filters are present because flt_end_analyse
needs to be called to release allocated resources and remove the
CF_FLT_ANALYZE flag. And the loss of this event prevents the analyser
from being called and from removing itself, preventing the connection
from ever ending.
This problem just shows that the event processing needs a serious revamp
after 1.8. In the mean time we can deal with the really problematic case
which is that we *want* to call analysers if CF_SHUTW is set on any side
ad it's the last opportunity to terminate a processing. It may
occasionally result in some analysers being called for nothing in half-
closed situations but it will take care of the issue.
An example of problematic configuration triggering the bug in 1.7 is :
frontend tcp
bind :4445
default_backend http
backend http
redirect location /
compression algo identity
Then submitting requests which immediately close will have for effect
to accumulate streams which will never be freed :
$ printf "GET / HTTP/1.1\r\n\r\n" >/dev/tcp/0/4445
This fix must be backported to 1.7 as well as any version where commit
c0c672a ("BUG/MINOR: http: Fix conditions to clean up a txn and to
handle the next request") was backported. This commit didn't cause the
bug but made it much more likely to happen.
Upon stream instanciation, we used to enable channel auto connect
and auto close to ease TCP processing. But commit 9aaf778 ("MAJOR:
connection : Split struct connection into struct connection and
struct conn_stream.") has revealed that it was a bad idea because
this commit enables reading of the trailing shutdown that may follow
a small requests, resulting in a read and a shutr turned into shutw
before the stream even has a chance to apply the filters. This
causes an issue with impossible situations where the backend stream
interface is still in SI_ST_INI with a closed output, which blocks
some streams for example when performing a redirect with filters
enabled.
Let's change this so that we only enable these two flags if there is
no analyser on the stream. This way process_stream() has a chance to
let the analysers decide whether or not to allow the shutdown event
to be transferred to the other side.
It doesn't seem possible to trigger this issue before 1.8, so for now
it is preferable not to backport this fix.
A customer reported a crash when within the HTTP request some headers
were not set leading to the module to crash. So the module ignore them
since empty data have no value for the detection.
Needs to be backported to 1.7.
Instead of storing the SSL_SESSION pointer directly in the struct server,
store the ASN1 representation, otherwise, session resumption is broken with
TLS 1.3, when multiple outgoing connections want to use the same session.
Now, we process at most 200 active applets per call to applet_run_active. We use
the same limit as the tasks. With the cache filter and the SPOE, the number of
active applets can now be huge. So, it is important to limit the number of
applets processed in applet_run_active.
applets_active_queue is the active queue size. It is a global variable. So it is
underoptimized because we may be lead to consider there are active applets for a
thread while in fact all active applets are assigned to the otherthreads. So, in
such cases, the polling loop will be evaluated many more times than necessary.
Instead, we now check if the thread id is set in the bitfield active_applets_mask.
This is specific to threads, no backport is needed.
a bitfield has been added to know if there are runnable applets for a
thread. When an applet is woken up, the bits corresponding to its thread_mask
are set. When all active applets for a thread is get to be processed, the thread
is removed from active ones by unsetting its tid_bit from the bitfield.
tasks_run_queue is the run queue size. It is a global variable. So it is
underoptimized because we may be lead to consider there are active tasks for a
thread while in fact all active tasks are assigned to the other threads. So, in
such cases, the polling loop will be evaluated many more times than necessary.
Instead, we now check if the thread id is set in the bitfield active_tasks_mask.
Another change has been made in process_runnable_tasks. Now, we always limit the
number of tasks processed to 200.
This is specific to threads, no backport is needed.
a bitfield has been added to know if there are runnable tasks for a thread. When
a task is woken up, the bits corresponding to its thread_mask are set. When all
tasks for a thread have been evaluated without any wakeup, the thread is removed
from active ones by unsetting its tid_bit from the bitfield.
Since the commit cd7879adc ("BUG/MEDIUM: threads: Run the poll loop on the main
thread too"), the log buffers are allocated after the proxies startup. So log
messages produced during this startup was ignored.
To fix the bug, we restore the initialization of these buffers before proxies
startup.
This is specific to threads, no backport is needed.
At the end of the master initialisation, a call to protocol_unbind_all()
was made, in order to close all the FDs.
Unfortunately, this function closes the inherited FDs (fd@), upon reload
the master wasn't able to reload a configuration with those FDs.
The create_listeners() function now store a flag to specify if the fd
was inherited or not.
Replace the protocol_unbind_all() by mworker_cleanlisteners() +
deinit_pollers()
Does not use the deinit() function during a reload, it's dangerous and
might be subject to double free, segfault and hazardous behavior if
it's called twice in the case of a execvp fail.
After execvp fails, the signals were ignored, preventing to try a reload
again. It is now fixed by reaching the top of the mworker_wait()
function once the execvp failed.
When the master worker fail the execvp, it returns the wrong error
"Cannot allocate memory".
We now display the accurate error corresponding to the errno value.
Now we can show in dotted red the node being removed or surrounded in red
a node having been inserted, and add a description on the graph related to
the operation in progress for example.
Use a smaller and cleaner fixed font, use upper case to indicate sides on
branches, remove the useless node/leaf markers on branches since the colors
already indicate them, and show the node's key as it helps spot the matching
leaf.
Disable the cache if the append of data failed, it should never happen
because the allocated row size is at least equal to the size of the
object to allocate.
Forward the remaining headers with the data in the first call of
cache_store_http_forward_data().
Previously the headers were forwarded first, and the function left,
implying an additionnal call to cache_store_http_forward_data() for the
data.
Cc: Christopher Faulet <cfaulet@haproxy.com>
Use msg->sov to forward headers instead of msg->eoh. It can causes some
problem because eoh does not contains the last \r\n, and the filter does
not support to send the headers partially.
Cc: Christopher Faulet <cfaulet@haproxy.com>
If haproxy is started using the name of the binary only (i.e.
not using a relative or absolute path) the `execv` in
`mworker_reload` fails with `ENOENT`, because it does not
examine the `PATH`:
[WARNING] 315/161139 (7) : Reexecuting Master process
[WARNING] 315/161139 (7) : Cannot allocate memory
[WARNING] 315/161139 (7) : Failed to reexecute the master processs [7]
The error messages are misleading, because the return value of
`execv` is not checked. This should be fixed in a separate commit.
Once this happened the master process ignores any further
signals sent by the administrator.
Replace `execv` with `execvp` to establish the expected
behaviour.
This bug was introduced in commit 73b85e75b3.
This macro should be used to declare variables or struct members depending on
the USE_THREAD compile option. It avoids the encapsulation of such declarations
between #ifdef/#endif. It is used to declare all lock variables.
In spoe_acquire_buffer and spoe_release_buffer, instead of checking the buffer
against buf_empty, we now check its size. It is important because when an
allocation fails, it will be set to buf_wanted. In both cases, the size is 0.
It is a proactive bug fix, no real problem was observed till now. It cannot be
backported as is in 1.7 because of all changes made on the SPOE in 1.8.
Marcus Rckert reported that commit d8b3b65 ("BUG/MEDIUM: splice/threads:
pipe reuse list was not protected.") broke threadless support. Add the
required #ifdef.
In the case of Transfer-Encoding: chunked, there is no Content-Length
which causes the cache to allocate a too small shctx row for the data.
It's not possible to allocate a shctx row for the chunks, we need to be
able to allocate on-the-fly the shctx blocks during the data transfer.
This method was reserved for the HTTP/2 connection preface, must never
be used and must be rejected. In normal situations it doesn't happen,
but it may be visible if a TCP frontend has alpn "h2" enabled, and
forwards to an HTTP backend which tries to parse the request. Before
this patch it would pass the wrong request to the backend server, now
it properly returns 400 bad req.
This patch should probably be backported to stable versions.
At a number of places, bitmasks are used for process affinity and to map
listeners to processes. Every time 1UL<<(relative_pid-1) is used. Let's
create a "pid_bit" variable corresponding to this value to clean this up.
It happens that no single analyser has ever needed to set res.analyse_exp,
so that process_stream() didn't consider it when computing the next task
expiration date. Since Lua actions were introduced in 1.6, this can be
needed on http-response actions for example, so let's ensure it's properly
handled.
Thanks to Nick Dimov for reporting this bug. The fix needs to be
backported to 1.7 and 1.6.
This is useful to know what thread(s) an fd is scheduled to be
handled on. It's worth noting that at the moment the "show fd"d
doesn't seem totally thread-safe.
In commit 53a4766 ("MEDIUM: connection: start to introduce a mux layer
between xprt and data") we introduced a release() function which ends
up never being used. Let's get rid of it now.
When a stream_interface performs a shutw() then a shutr(), the stream
is marked closed. Then cs_destroy() calls h2_detach() and it cannot
fail since we're on the leaving path of the caller. The problem is that
in order to close streams we usually have to send either an emty DATA
frame with the ES flag set or an RST_STREAM frame, and the mux buffer
might already be full, forcing the stream to be queued. The forced
removal of this stream causes this last message to silently disappear,
and the client to wait forever for a response.
This commit ensures we can detach the conn_stream from the h2 stream
if the stream is blocked, effectively making the h2 stream an orphan,
ensures that the mux can deal with orphaned streams after processing
them, and that the demux can kill them upon receipt of GOAWAY.
There is an issue with how the RST_STREAM frames are sent. Some of
them are sent from the demux, either for valid or for closed streams,
and some are sent from the mux always for valid streams. At the moment
the demux stream ID is used, which is wrong for all streams being muxed,
and sometimes results in certain bad HTTP responses causing the emission
of an RST_STREAM referencing stream zero. In addition, the stream's
blocked flags could be updated even if the stream was the closed or
idle ones.
We really need to split the function for the two distinct use cases where
one is used to send an RST on a condition detected at the connection level
(such as a closed stream) and the other one is used to send an RST for a
condition detected at the stream level. The first one is used only in the
demux, and the other one only by a valid stream.
To be thread safe, the function pattern_exec_match copy data (the pattern and
the inner sample) in thread-local variables. But when the sample is duplicated,
we must check its type and not the pattern one.
This is specific to threads, no backport is needed.
When a write activity is reported on a channel, it is important to keep this
information for the stream because it take part on the analyzers' triggering.
When some data are written, the flag CF_WRITE_PARTIAL is set. It participates to
the task's timeout updates and to the stream's waking. It is also used in
CF_MASK_ANALYSER mask to trigger channels anaylzers. In the past, it was cleared
by process_stream. Because of a bug (fixed in commit 95fad5ba4 ["BUG/MAJOR:
stream-int: don't re-arm recv if send fails"]), It is now cleared before each
send and in stream_int_notify. So it is possible to loss this information when
process_stream is called, preventing analyzers to be called, and possibly
leading to a stalled stream.
Today, this happens in HTTP2 when you call the stat page or when you use the
cache filter. In fact, this happens when the response is sent by an applet. In
HTTP1, everything seems to work as expected.
To fix the problem, we need to make the difference between the write activity
reported to lower layers and the one reported to the stream. So the flag
CF_WRITE_EVENT has been added to notify the stream of the write activity on a
channel. It is set when a send succedded and reset by process_stream. It is also
used in CF_MASK_ANALYSER. finally, it is checked in stream_int_notify to wake up
a stream and in channel_check_timeouts.
This bug is probably present in 1.7 but it seems to have no effect. So for now,
no needs to backport it.
If the H1 parser would report a status code length not consisting in
exactly 3 digits, the error case was confused with a lack of buffer
room and was causing the parser to loop infinitely.
The H1 parser used by the H2 gateway was a bit lax and could validate
non-numbers in the status code. Since it computes the code on the fly
it's problematic, as "30:" is read as status code 310. Let's properly
check that it's a number now. No backport needed.
The build breaks on a machine without openssl/crypto.h because shctx
still loads openssl-compat.h while it doesn't need it anymore since
the code was moved :
In file included from src/shctx.c:20:0:
include/proto/openssl-compat.h:3:28: fatal error: openssl/crypto.h: No such file or directory
#include <openssl/crypto.h>
Just remove include openssl-compat from shctx.
Commit 522eea7 ("MINOR: ssl: Handle sending early data to server.") added
a dependency on SRV_SSL_O_EARLY_DATA which only exists when USE_OPENSSL
is defined (which is probably not the best solution) and breaks the build
when ssl is not enabled. Just add an ifdef USE_OPENSSL around the block
for now.
This adds a new keyword on the "server" line, "allow-0rtt", if set, we'll try
to send early data to the server, as long as the client sent early data, as
in case the server rejects the early data, we no longer have them, and can't
resend them, so the only option we have is to send back a 425, and we need
to be sure the client knows how to interpret it correctly.
With TLS 1.3, session aren't established until after the main handshake
has completed. So we can't just rely on calling SSL_get1_session(). Instead,
we now register a callback for the "new session" event. This should work for
previous versions of TLS as well.
My recent change in commit ce4e0aa ("MEDIUM: task: change the construction
of the loop in process_runnable_tasks()") was bogus as it used to keep the
rq_next across an unlock/lock sequence, occasionally leading to crashes for
tasks that are eligible to any thread. We must use the lookup call for each
new batch instead. The problem is easily triggered with such a configuration :
global
nbthread 4
listen check
mode http
bind 0.0.0.0:8080
redirect location /
option httpchk GET /
server s1 127.0.0.1:8080 check inter 1
server s2 127.0.0.1:8080 check inter 1
Thanks to Olivier for diagnosing this one. No backport is needed.
Commit 4ac4928 ("BUG/MINOR: stream-int: don't set MSG_MORE on SHUTW_NOW
without AUTO_CLOSE") was incomplete. H2 reveals another situation where
the input stream is marked closed with the request and we set MSG_MORE,
causing a delay before the request leaves.
Better avoid setting the flag on the request path for close cases in
general.
As part of the detection for intentional closes, we can kill the
connection if a shutw() happens before the headers. But it can also
happen that an invalid response is not properly parsed, preventing
any headers frame from being sent and making the function believe
it was an abort. Now instead we check if any response was received
from the stream, regardless of the fact that it was properly
converted.
It's pointless to requeue the task when we're closing, so swap the
order of the task_queue() and h2_release(). It also matches what
was written in the comment regarding re-arming the timer.
h2_detach() is called after a stream was closed, and it evaluates if it's
worth closing the connection. The issue there is that the connection is
closed too early in case there's demand for closing after the last stream,
even if some data remain in the mux. Let's change the condition to check
for this.
When the assignment of the connection state was moved into h2c_error(),
3 of them were missed because they were wrong, using H2_SS_ERROR instead.
This resulted in the connection's state being set to H2_CS_ERROR2 in fact,
so the error was not properly sent.
This one was created to maintain the knowledge that a stream was closed
after having sent an RST_STREAM frame but that's not needed anymore and
it confuses certain conditions on the error processing path. It's time
to get rid of it.
The call to xprt->snd_buf() was not conditionned on the presence of
data in the buffer, resulting in snd_buf() returning 0 and never
disabling the polling. It was revealed by the previous bug on error
processing but must properly be handled.
Some stream errors are detected on the MUX path (eg: H1 response
encoding). The ones forgot to emit an RST_STREAM frame, causing the
client to wait and/or to see the connection being immediately closed.
This is now fixed.
This flag was added after the GOAWAY flags were introduced and mistakenly
placed in the connection, but that doesn't make sense as it's specific to
the stream. The main impact is the risk of returning a DATA0+ES frame for
an error instead of an RST_STREAM.
snr_check_ip_callback() may be called with the server lock, so don't attempt
to lock it again, instead, make sure the callers always have the lock before
calling it.
The scheduler is complex and uses local queues to amortize the cost of
locks. But all this comes with a cost that is quite observable with
single-thread workloads.
The purpose of this patch is to reimplement the much simpler scheduler
for the case where threads are not used. The code is very small and
simple. It doesn't impact the multi-threaded performance at all, and
provides a nice 10% performance increase in single-thread by reaching
606kreq/s on the tests that showed 550kreq/s before.
process_runnable_tasks() needs to requeue or wake up tasks after
processing them in batches. By only refilling the existing ones, we
avoid revisiting all the queue. The performance gain is measurable
starting with two threads, where the request rate climbs to 657k/s
compared to 644k.
This patch slightly rearranges the loop to pack the locked code a little
bit, and to try to concentrate accesses to the tree together to benefit
more from the cache.
It also fixes how the loop handles the right margin : now that is guaranteed
that the retrieved nodes are filtered to only match the current thread, we
don't need to rewind every 16 entries. Instead we can rewind each time we
reach the right margin again.
With this change, we now achieve the following performance for 10 H2 conns
each containing 100 streams :
1 thread : 550kreq/s
2 thread : 644kreq/s
3 thread : 598kreq/s
This function is sensitive, let's make it shorter by factoring out the
unlock and leave code. This reduced the function's size by a few tens
of bytes and increased the overall performance by about 1%.
Currently the task scheduler suffers from an O(n) lookup when
skipping tasks that are not for the current thread. The reason
is that eb32_lookup_ge() has no information about the current
thread so it always revisits many tasks for other threads before
finding its own tasks.
This is particularly visible with HTTP/2 since the number of
concurrent streams created at once causes long series of tasks
for the same stream in the scheduler. With only 10 connections
and 100 streams each, by running on two threads, the performance
drops from 640kreq/s to 11.2kreq/s! Lookup metrics show that for
only 200000 task lookups, 430 million skips had to be performed,
which means that on average, each lookup leads to 2150 nodes to
be visited.
This commit backports the principle of scope lookups for ebtrees
from the ebtree_v7 development tree. The idea is that each node
contains a mask indicating the union of the scopes for the nodes
below it, which is fed during insertion, and used during lookups.
Then during lookups, branches that do not contain any leaf matching
the requested scope are simply ignored. This perfectly matches a
thread mask, allowing a thread to only extract the tasks it cares
about from the run queue, and to always find them in O(log(n))
instead of O(n). Thus the scheduler uses tid_bit and
task->thread_mask as the ebtree scope here.
Doing this has recovered most of the performance, as can be seen on
the test below with two threads, 10 connections, 100 streams each,
and 1 million requests total :
Before After Gain
test duration : 89.6s 4.73s x19
HTTP requests/s (DEBUG) : 11200 211300 x19
HTTP requests/s (PROD) : 15900 447000 x28
spin_lock time : 85.2s 0.46s /185
time per lookup : 13us 40ns /325
Even when going to 6 threads (on 3 hyperthreaded CPU cores), the
performance stays around 284000 req/s, showing that the contention
is much lower.
A test showed that there's no benefit in using this for the wait queue
though.
The first pid in the pidfile is now the parent, it's more convenient for
supervising the processus.
You can now reload haproxy in master-worker mode with convenient command
like: kill -USR2 $(head -1 /tmp/haproxy.pid)
Commit 0493149 ("MINOR: thread: report multi-thread support in haproxy -vv")
added information about thread support in haproxy -vv output but accidently
marked the message as "must_free" while it's a constant. This causes a segv
on the old process on clean exit if threads are enabled. It doesn't affect
the stability during operations however.
unbind_listener() takes the listener lock, which is already held by
enable_listener(). This situation happens when starting with nbproc > 1
with some bind lines limited to a certain process, because in this case
enable_listener() tries to stop unneeded listeners.
This commit introduces __do_unbind_listeners() which must be called with
the lock held, and makes enable_listener() use this one. Given that the
only return code has never been used and that it starts to make the code
more complicated to propagate it before throwing it to the trash, the
function's return type was changed to void.
The spin_unlock() was called just before setting the expiry to
TICK_ETERNITY, so if another thread has the time to perform its
update and set a timeout, this would would clear it.
Commit c3680ec ("MINOR: add severity information to cli feedback messages")
introduced a severity level to CLI messages, but one of them was missed
on "set server addr". No backport is needed.
Given that all spinning loops we've had since 1.8-rc1 were caused by
unbalanced lock/unlock, let's get rid of all return statements in the
locked check functions and only exit via a a single unlock place.
The "set server <srv> check-port" CLI handler forgot to return after
detecting an error on the port number, and still proceeds with the action.
This needs to be backported to 1.7.
Some unlocks were missing, resulting in deadlocks even with a single thread.
We really need to make these functions safer by getting rid of all those
remaining "return" calls and only leave using a goto!
Using peers or stick table we could update an freq_ctr
using a tick value with the first bit set but this
bit is reserved for lock since multithreading support.
Fix bugs due to missing unlock and recursive lock performing
http health check.
The server's lock scope was enlarged to protect all callers
of 'set_server_check_status' and 'chk_report_conn_err'.
This fix also protects tcpcheck against concurrency.
Before introducing the mux layer, tcp_connect() would poll for sending
to detect the connection establishment. It happens that the health
checks have apparently never explicitly enabled this polling and have
been relying on this implicit one.
Now that there's the mux layer, the conn_stream needs to be enabled
for polling as well and since it's not done in the checks, it's never
done and the check's request doesn't leave the machine, as can be
noticed with http checks.
The solution simply consists in going back to the well-known case
where we enable polling after connecting using cs_want_send() if we
have anything but just a plain connect(). The regular data path is
not affected because the stream interface code automatically computes
the polling needs based on buffer contents.
This situation which must not happen does in fact happen when feeding
artificial responses using errorfiles, Lua or an applet. For now it
causes the H1 response parser to loop forever trying to get a more
complete response. Since it cannot progress, let's return an error.
Don't bother testing if len is nonzero, we know it is, as we're in the
"else" part of a if (!len), and testing it confuses clang into thinking
ret may be left uninitialized.
Store object in the cache. The cache use an shctx for storage.
It uses an http-response action to store the headers and a filter to
store the body. The http-response action is used in order to allow
modifications by other actions before caching.
Forbid shctx to read more than expected, it allows you to use a greater
value as a len with shctx_row_data_get(), the size of the destination
buffer for example.
Previous commit ea3928 (MEDIUM: h2: apply a timeout to h2 connections)
was wrong for two reasons. The first one is that if the client timeout
is not set, it's used as zero, preventing connections from establishing.
The second reason is that if the timeout triggers with active streams
(normally it should not since the task is supposed to be disabled), the
task is removed (h2c->task=NULL), and the last quitting stream might
try to dereference it.
Instead of doing this, we simply not register the task if there's no
timeout (it's useless) and we always control its presence in the streams.
Till now there was no way to deal with a dead H2 connection. Now each
connection creates a task that wakes up to kill the connection. Its
timeout is constantly refreshed when there's some activity. In case
the timeout triggers, the best effort attempts are made at sending a
clean GOAWAY message before closing and signaling the streams.
The timeout is automatically disabled when there's an active stream on
the connection, and restarted when the last stream finishes. This way
it should not affect long sessions.
Given that we're processing data produced by haproxy, we know that the
situations where haproxy doesn't return anything are :
- request timeout with option http-ignore-probes : there's no reason to
hit this since we're creating the stream with the request into it ;
- tcp-request content reject : this definitely means we want to kill the
connection and abort keep-alive and any further processing ;
- using /dev/null as the error file to hide an error
In practice it appears that using the abort on empty response as a hint to
trigger a connection close is very appropriate to continue to give the
control over the connection management. This patch thus tries to send a
GOAWAY frame with the max_id presented as the last stream ID, then sends
an RST_STREAM for the current stream. For the client, this means that the
connection must be shut down immediately after processing the last pending
streams and that the current stream is aborted. This way it's still possible
to force connections to be closed using tcp-request rules.
After some long brainstorming sessions, it appears that "Connection: close"
seems to be the best signal from the L7 layer to indicate the need to close
the connection. Indeed, in H1 it is only present in very rare cases (eg:
certain unrecoverable errors, some of which could remove it now by the way).
It will also be added when the L7 layer wants to force the connection to
terminate. By default when running in keep-alive mode it is not present.
It's worth mentionning that in H1 with persistent connections, we have sort
of a concurrency-1 mux and this header field is used the same way.
Thus here this patch detects "Connection: close" in response headers and
if seen, sends a GOAWAY frame with the highest possible ID so that the
client knows that it can quit whenever it wants to. If more aggressive
closures are needed in the future, we may decide to advertise the max_id
to abort after the current requests and better honor "http-request deny".
For a graceful shutdown, the specs requries to discard frames with a
stream ID higher than the advertised last_id. (RFC7540#6.8). Well,
finally for now the code is disabled (see last page of #6.8). Some
frames need to be processed anyway to maintain the compression state
and the flow control window state, but we don't have any trivial way
to do this and ignore them at the same time. For the headers it's
the worst case where we can't parse headers frames without coming
from the streams, and we don't want to create such streams as we'd
have to abort them, and aborting would cause errors to flow back.
Possibly that a longterm solution might involve using some dummy
streams and dummy buffers for this and calling the parsers directly.
RFC7540#5.1 is pretty clear : "any frame other than WINDOW_UPDATE,
PRIORITY, or RST_STREAM in this state MUST be treated as a connection
error of type STREAM_CLOSED". Instead of dealing with this for each
and every frame type, let's do it once for all in the main demux loop.
RFC7540#5.1 is pretty clear : "any frame other than HEADERS or PRIORITY
in this state MUST be treated as a connection error". Instead of dealing
with this for each and every frame type, let's do it once for all in the
main demux loop.
The ID is respected, and only IDs greater than the advertised last_id
are woken up, with a CS_FL_ERROR flag to signal that the stream is
aborted. This is necessary for a browser to abort a download or to
reject a bad response that affects the connection's state.
Let's replace h2_wake_all_streams() with h2_wake_some_streams(), to
support signaling only streams by their ID (for GOAWAY frames) and
to pass the flags to add on the conn_stream.
When a stream sends a shutw, we send an empty DATA frame with the ES
flag set, except if no HEADERS were sent, in which case we rather send
RST_STREAM. On shutr(1) to abort a request, an RST_STREAM frame is sent
if the stream is OPEN and the stream is closed. Care is taken to switch
the stream's state accordingly and to avoid sending an ES bit again or
another RST once already done.
Data frames are received and transmitted. The per-connection and
per-stream amount of data to ACK is automatically updated. Each
DATA frame is ACKed because usually the downstream link is large
and the upstream one is small, so it seems better to waste a few
bytes every few kilobytes to maintain a low ACK latency and help
the sender keep the link busy. The connection's ACK however is
sent at the end of the demux loop and at the beginning of the mux
loop so that a single aggregated one is emitted (connection
windows tend to be much larger than stream windows).
A future improvement would consist in sending a single ACK for
multiple subsequent DATA frames of the same stream (possibly
interleaved with window updates frames), but this is much trickier
as it also requires to remember the ID of the stream for which
DATA frames have to be sent.
Ideally in the near future we should chunk-encode the body sent
to HTTP/1 when there's no content length and when the request is
not a CONNECT. It's just uncertain whether it's the best option
or not for now.
When it is detected that the number of received bytes is > 0 on the
connection at the end of the demux call or before starting to process
pending output data, an attempt is made at sending a WINDOW UPDATE on
the connection. In case of failure, it's attempted later.
For now we don't build a HEADERS frame with them, but at least we remove
them from the response so that the L7 chunk parser inside isn't blocked
on these (often two) remaining bytes that don't want to leave the buffer.
It also ensures that trailers delivered progressively will correctly be
skipped.
The H1 response data are processed (either following content-length or
chunks) and emitted as H2 DATA frames. In the case of content-length,
the maximum size permitted by the mux buffer, the max frame size, the
connection's window and the stream's window it used to determine the
frame size. For chunked encoding, the same limitation applies, but in
addition, each chunk leads to a distinct frame. This could be improved
in the future to aggregate chunks into larger frames.
Streams blocked on the connection's flow control subscribe to the
connection's fctl_list to be woken up when the window opens again.
Streams blocked on their own flow control don't subscribe to anything,
they just sit waiting for window update frames to reopen the window.
The connection-close mode (without content-length) partially works thanks
to the fact that the SHUTW event leads to a close of the stream. In
practice an empty DATA frame should be sent in this case though.
This calls the h1 response parser and feeds the output through the hpack
encoder to produce stateless HPACK bytecode into an output chunk. For now
it's a bit naive but reasonably efficient.
The HPACK encoder relies on hpack_encode_header() so that the most common
response header fields are encoded based on the static header table. The
forbidden header field names (connection, proxy-connection, upgrade,
transfer-encoding, keep-alive) are dropped before calling the hpack
encoder.
A new flag (H2_CF_HEADERS_SENT) is set once such a frame is emitted. It
will be used to know if we can send an empty DATA+ES frame to use as a
shutdown() signal or if we have to use RST_STREAM.
The trash is already used by the hpack layer and for Huffman decoding,
it's unsafe to use here as a buffer and results in corrupted data. Use
a safely allocated trash instead.
This takes care of creating a new h2s and a new conn_stream when a
HEADERS frame arrives. The recv() callback from the data layer is then
called to extract the frame into the stream's buffer. It is verified
that the stream ID is strictly greater than the known max stream ID.
And the last_id is updated if the current request is properly converted.
The streams are created in open or half-closed(remote) states.
For now there are some limitations :
- frames without END_HEADERS are rejected (CONTINUATION not supported
yet, will require some more changes so that the stream processor
checks the H2 frame header by itself and steals the frames from the
connection)
- padding/stream_dep/priority are currently ignored
- limited error handling, could be improved
But at least the request is properly decoded, transcoded and processed.
If a stream is killed for whatever reason and it happens to be the one
currently blocking the connection, we must unblock the connection and
enable polling again so that it can attempt to make progress. This may
happen for example on upload timeout, where the demux is blocked due to
a full stream buffer, and the stream dies on server timeout and quits.
This does the very minimum required to release a stream and/or a connection
upon the stream's request. The only thing is that it doesn't kill the
connection unless it's already closed or in error or the stream ID reached
the one specified in GOAWAY frame. We're supposed to arm a timer to close
after some idle timeout but it's not done.
For now we have nowhere to store partial header frames so we can't
handle CONTINUATION frames and we must reject them. In this case we
respond with a stream error of type INTERNAL_ERROR.
This one sends an RST_STREAM for a given stream, using the current
demux stream ID. It's also used to send RST_STREAM for streams which
have lost their CS part (ie were aborted).
Now they really increase the window size of connections and streams.
If a stream was not queued but requested to send, it means it was
flow-controlled so it's added again into the connection's send list.
The INITIAL_WINDOW_SIZE and MAX_FRAME_SIZE settings are now extracted
from the settings frame, assigned to the connection, and attempted to
be propagated to all existing streams as per the specification. In
practice clients rarely update the settings after sending the first
stream, so the propagation will rarely be used. The ACK is properly
sent after the frame is completely parsed.
The function h2_process_demux() now tries to parse the incoming bytes
to process as many streams as possible. For now it does nothing but
dropping all incoming frames.
Instead of doing a special processing of the first SETTINGS frame, we
simply parse its header, check that it matches the expected frame type
and flags (ie no ACK), and switch to FRAME_P to parse it as any regular
frame. The regular frame parser will take care of decoding it.
An initial settings frame is emitted upon receipt of the connection
preface, which takes care of configured values. These settings are
only emitted when they differ from the protocol's default value :
- header_table_size (defaults to 4096)
- initial_window_size (defaults to 65535)
- max_concurrent_streams (defaults to unlimited)
- max_frame_size (defaults to 16384)
The max frame size is a copy of tune.bufsize. Clients will most often
reject values lower than 16384 and currently there's no trivial way to
check if H2 is going to be used at boot time.
The send() callback calls h2_process_mux() which iterates over the list
of flow controlled streams first, then streams waiting for room in the
send_list. If a stream from the send_list ends up being flow controlled,
it is then moved to the fctl_list. This way we can maintain the most
accurate fairness by ensuring that flows are always processed in order
of arrival except when they're blocked by flow control, in which case
only the other ones may pass in front of them.
It's a bit tricky as we want to remove a stream from the active lists
if it doesn't block (ie it has no reason for staying there).
If the polling update function is called with RD_ENA while H2_CF_DEM_SFULL
indicates the demux had to block on a stream buffer full condition, we can
remove the flag and re-enable polling for receiving because this is the
indication that a consumer stream has made some room in the buffer. Probably
that we should improve this to ensure that h2s->id == h2c->dsi and avoid
trying to receive multiple times in a row for the wrong stream.
A conn_stream indicates its intent to send by setting the WR_ENA flag
and calling mux->update_poll(). There's no synchronous write so the only
way to emit a response from a stream is to proceed this way. The sender
h2s is then queued into the h2c's send_list if it was not yet queued.
Once the connection is ready, it will enter its send() callback to visit
writers, calling their data->send_cb() callback to complete the operation
using mux->snd_buf().
Also we enable polling if the mux contains data and wasn't enabled. This
may happen just after a response has been transmitted using chk_snd().
It likely is incomplete for now and should probably be refined.
The H2 preface is properly detected to switch to the settings state.
It's important to note that for now we don't send out settings frame
so the operation is not complete yet.
For now it's only used to report immediate errors by announcing the
highest known stream-id on the mux's error path. The function may be
used both while processing a stream or directly in relation with the
connection. The wake() callback will automatically ask for send access
if an error is reported. The function should be usable for graceful
shutdowns as well by simply setting h2c->last_sid to the highest
acceptable stream-id (2^31-1) prior to calling the function.
A connection flag (H2_CF_GOAWAY_SENT) is set once the frame was
successfully sent. It will be usable to detect when it's safe to
close the connection.
Another flag (H2_CF_GOAWAY_FAILED) is set in case of unrecoverable
error while trying to send. It will also be used to know when it's safe
to close the connection.
The rcv_buf() callback now calls h2_process_demux() after an recv() call
leaving some data in the buffer, and the snd_buf() callback calls
h2_process_mux() to try to process pending data from streams.
If some streams were blocked on flow control and the connection's
window was recently opened, or if some streams are waiting while
no block flag remains, we immediately want to try to send again.
This can happen if a recv() for a stream wants to send after the
send() loop has already been processed.
During h2_wake(), there are various situations that can lead to the
connection being closed :
- low-level connection error
- read0 received
- fatal error (ERROR2)
- failed to emit a GOAWAY
- empty stream list with max_id >= last_sid
In such cases, all streams are notified and we have to wait for all
streams to leave while doing nothing, or if the last stream is gone,
we can simply terminate the connection.
It's important to do this test there again because an error might arise
while trying to send a pending GOAWAY after the last stream for example,
thus there's possibly no way to get notified of a closing stream.
It happens that an H2 mux is totally unusable once the client has shut,
so we must consider this situation equivalent to the connection error,
and let the possible streams drain their data if needed then stop.
Now we start to set the flags to indicate that the response buffer is
being awaited or that it is full, it makes it possible to centralize a
little bit the polling management into the wake() callback.
In case of error, we wake all the streams up so that they are aware of
the nature of the event and are able to detach if needed.
Flag H2_CF_DEM_DALLOC is set when the demux buffer fails to be allocated
in the recv() callback, and is cleared when it succeeds.
Both flags H2_CF_MUX_MALLOC and H2_CF_DEM_MROOM are cleared when the mux
buffer allocation succeeds.
In both cases it will be up to the callers to report allocation failures.
This one will be used by the HEADERS frame handler and maybe later by
the PUSH frame handler. It creates a conn_stream in the mux's connection.
The create streams are inserted in the h2c's tree sorted by IDs. The
caller is expected to have verified that the stream doesn't exist yet.
It will be more convenient to always manipulate existing streams than
null pointers. Here we create one idle stream and one closed stream.
The idea is that we can easily point any stream to one of these states
in order to merge maintenance operations.
Functions h2_get_buf_n{16,32,64}() and h2_get_buf_bytes() respectively
extract a network-ordered 16/32/64 bit value from a possibly wrapping
buffer, or any arbitrary size. They're convenient to retrieve a PING
payload or to parse SETTINGS frames. Since they copy one byte at a time,
they will be less efficient than a memcpy-based implementation on large
blocks.
This function extracts the next frame header but doesn't consume it.
This will allow to detect a stream-id change and to perform a yielding
window update without losing information. The result is stored into a
temporary frame descriptor. We could also store the next frame header
into the connection but parsing the header again is much cheaper than
wasting bytes in the connection for a rare use case.
A function (h2_skip_frame_hdr()) is also provided to skip the parsed
header (always 9 bytes) and another one (h2_get_frame_hdr()) to do both
at once.
This function is called after preparing a frame, in order to update the
frame's size in the frame header. It takes the frame payload length in
argument.
It simply writes a 24-bit frame size into a buffer, making use of the
net_helper functions which try to optimize per platform (this is a
frequently used operation).
This one will store the error into the stream's errcode if it's neither
idle nor closed (since these ones are read-only) and switch its state to
H2_SS_ERROR. If a conn_stream is attached, it will be flagged with
CS_FL_ERROR.
A mux is busy when any stream id >= 0 is currently being handled
and the current stream's id doesn't match. When no stream is
involved (ie: demuxer), stream 0 is considered. This will be
necessary to know when it's possible to send frames.
A demux may be prevented from receiving for the following reasons :
- no receive buffer could be allocated
- the receive buffer is full
- a response is needed and the mux is currently being used by a stream
- a response is needed and some room could not be found in the mux
buffer (either full or waiting for allocation)
- the stream buffer is waiting for allocation
- the stream buffer is full
A mux may stop accepting data for the following reasons :
- the buffer could not be allocated
- the buffer is full
A stream may stop sending data to a mux for the following reaons :
- the mux is busy processing another stream
- the mux buffer lacks room (full or not allocated)
- the mux's flow control prevents from sending
- the stream's flow control prevents from sending
All these conditions were turned into flags for use by the respective
places.
The idea is that we may need a mux buffer for anything, ranging from
receiving to sending traffic. For now it's unclear where exactly the
calls will be placed so let's block both send and recv when a buffer
is missing, and re-enable both of them at the end. This will have to
be changed later.
This patch implements a very basic Rx buffer management. The mux needs
an rx buffer to decode the connection's stream. If this buffer it
available upon Rx events, we fill it with whatever input data are
available. Otherwise we try to allocate it and subscribe to the buffer
wait queue in case of failure. In such a situation, a function
"h2_dbuf_available()" will be called once a buffer may be allocated.
The buffer is released if it's still empty after recv().
The connection's h2c context is now allocated and initialized on mux
initialization, and released on mux destruction. Note that for now the
release() code is never called.
We need to deal with stream error notifications (RST_STREAM) as well as
internal reporting. The problem is that we don't know in which order
this will be done so we can't unilaterally decide to deallocate the
stream. In order to help, we add two extra stream states, H2_SS_ERROR
and H2_SS_RESET. The former mentions that the stream has an error pending
and the latter indicates that the error was already sent and that the
stream is now closed. It's equivalent to H2_SS_CLOSED except that in this
state we'll avoid sending new RST_STREAM as per RFC7540#5.4.2.
With this it will be possible to only detach or deallocate the h2s once
the stream is closed.
This describes an HTTP/2 stream with its relation to the connection
and to the conn_stream on the other side.
For now we also allocate request and response state for HTTP/1 because
the internal HTTP representation is HTTP/1 at the moment. Later this
should evolve towards a version-agnostic representation and this H1
message state will disappear.
It's important to consider that the streams are necessarily polarized
depending on h2c : if the connection is incoming, streams initiated by
the connection receive requests and send responses. Otherwise it's the
other way around. Such information is known during the connection
instanciation by h2c_frt_init() and will normally be reflected in the
stream ID (odd=demux from client, even=demux from server). The initial
H2_CS_PREFACE state will also depend on the direction. The current h2c
state machine doesn't allow for outgoing connections as it uses a single
state for both (rx state only). It should be the demux state only.
The h2c struct describes an H2 connection context and is assigned as the
mux's context. It has its own pool, allocated at boot time and released
after deinit().
For now it only supports literals and a bit of static header table
references for the 9 most common header field names (date, server,
content-type, content-length, last-modified, accept-ranges, etag,
cache-control, location).
A previous incarnation of this commit used to strip the forbidden H2
header names (connection, proxy-connection, upgrade, transfer-encoding,
keep-alive) but this is no longer the case as this filtering is irrelevant
to HPACK encoding and is specific to H2, so this will have to be done by
the caller.
It's quite not optimal but works fine enough to prepare some valid and
partially compressed responses during development.
The decoder is now fully functional. It makes use of the dynamic header
table. Dynamic header table size updates are currently ignored, as our
initially advertised value is the highest we support. Strictly speaking,
the impact is that a client referencing a header field after such an
update wouldn't observe an error instead of the connection being dropped
if it was implemented.
Decoded header fields are copied into a target buffer in HTTP/1 format
using HTTP/1.1 as the version. The Host header field is automatically
appended if a ":authority" header field is present.
All decoded header fields can be displayed if the file is compiled with
DEBUG_HPACK.
This code deals with header insertion, retrieval and eviction, as well
as with dynamic header table defragmentation. It is functional for use
as a decoder and was heavily tested in this context. There's still some
room for optimization (eg: the defragmentation code currently does it
in place using a memcpy).
Also for now the dynamic header table is allocated using malloc() while
a pool needs to be created instead.
This code was mostly imported from https://github.com/wtarreau/http2-exp
with "hpack_" prepended in front of most names to avoid risks of conflicts.
Some small cleanups and renamings were applied during the import. This
version must be considered more recent.
Some HPACK error codes were placed here (HPACK_ERR_*), not exactly because
they're needed by the decoder but they'll be needed by all callers. Maybe
a different location should be found.
The code was borrowed from the HPACK experimental implementations
available here :
https://github.com/wtarreau/http2-exp
It contains the Huffman table as specified in RFC7541 Appendix B, and a
set of reverse tables used to decode a Huffman byte stream, and produced
by contrib/h2/gen-rht. The encoder is not finalized, it doesn't emit the
byte stream but this is not needed for now.
Now we don't remove the session when a stream dies, instead we
detach the stream and let the mux decide to release the connection
and call session_free() instead.
Since multiple streams can share one session attached to one listener,
the listener_release() call must be done in session_free() and not in
stream_free(), otherwise we end up with a negative count in H2.
This callback will be used to release upper layers when a mux is in
use. Given that the mux can be asynchronously deleted, we need a way
to release the extra information such as the session.
This callback will be called directly by the mux upon releasing
everything and before the connection itself is released, so that
the callee can find its information inside the connection if needed.
The way it currently works is not perfect, and most likely this should
instead become a mux release callback, but for now we have no easy way
to add mux-specific stuff, and since there's one mux per connection,
it works fine this way.
Now that the mux will take care of closing the client connection at the
right moment, we don't need to close the client connection anymore, and
we just need to close the conn_stream.
For H2, only the mux's timeout or other conditions might cause a
release of the mux and the connection, no stream should be allowed
to kill such a shared connection. So a stream will only detach using
cs_destroy() which will call mux->detach() then free the cs.
For now it's only handled by mux_pt. The goal is that the data layer
never has to care about the connection, which will have to be released
depending on the mux's mood.
At all call places where a conn_stream is in use, we can now use
cs_close() to get rid of a conn_stream and of its underlying connection
if the mux estimates it makes sense. This is what is currently being
done for the pass-through mux.
Now these functions are able to automatically close both the transport
and the socket layer, causing the whole connection to be torn down if
needed.
The two shutdown modes are implemented for both directions, and when
a direction is closed, if it sees the other one is closed as well, it
completes by closing the connection. This is similar to what is performed
in the stream interface.
It's not deployed yet but the purpose is to get rid of conn_full_close()
where only conn_stream should be known.
Instead of having to manually handle lingering outside, let's make
conn_sock_shutw() check for it before calling shutdown(). We simply
don't want to emit the FIN if we're going to reset the connection
due to lingering. It's particularly important for silent-drop where
it's absolutely mandatory that no packet leaves the machine.
In a 1:1 connection:stream there's no problem relying on the connection
flags alone to check for errors. But in a mux, it will be possible to mark
certain streams in error without having to mark all of them. An example is
an H2 client sending RST_STREAM frames to abort a long download, or a parse
error requiring to abort only this specific stream.
This commit ensures that stream-interface and checks properly check for
CS_FL_ERROR in cs->flags wherever CO_FL_ERROR was in use. Most likely over
the long term, any check for CO_FL_ERROR will have to disappear.
All the references to connections in the data path from streams and
stream_interfaces were changed to use conn_streams. Most functions named
"something_conn" were renamed to "something_cs" for this. Sometimes the
connection still is what matters (eg during a connection establishment)
and were not always renamed. The change is significant and minimal at the
same time, and was quite thoroughly tested now. As of this patch, all
accesses to the connection from upper layers go through the pass-through
mux.
This patch introduces a new struct conn_stream. It's the stream-side of
a multiplexed connection. A pool is created and destroyed on exit. For
now the conn_streams are not used at all.
A new sample fetch function reports either 1 or 2 for the on-wire encoding,
to indicate if the request was received using the HTTP/1.x format or HTTP/2
format. Note that it reports the on-wire encoding, not the version presented
in the request header.
This will possibly have to evolve if it becomes necessary to report the
encoding on the server side as well.
When an incoming connection is made on an HTTP mode frontend, the
session now looks up the mux to use based on the ALPN token and the
proxy mode. This will allow easier mux registration, and we don't
need to hard-code the mux_pt_ops anymore.
The pass-through mux is the fallback used on any incoming connection
unless another mux claims the ALPN name and the proxy mode. Thus mux_pt
registers ALPN token "" (empty name) which catches everything.
Selecting a mux based on ALPN and the proxy mode will quickly become a
pain. This commit provides new functions to register/lookup a mux based
on the ALPN string and the proxy mode to make this easier. Given that
we're not supposed to support a wide range of muxes, the lookup should
not have any measurable performance impact.
For HTTP/2 and QUIC, we'll need to deal with multiplexed streams inside
a connection. After quite a long brainstorming, it appears that the
connection interface to the existing streams is appropriate just like
the connection interface to the lower layers. In fact we need to have
the mux layer in the middle of the connection, between the transport
and the data layer.
A mux can exist on two directions/sides. On the inbound direction, it
instanciates new streams from incoming connections, while on the outbound
direction it muxes streams into outgoing connections. The difference is
visible on the mux->init() call : in one case, an upper context is already
known (outgoing connection), and in the other case, the upper context is
not yet known (incoming connection) and will have to be allocated by the
mux. The session doesn't have to create the new streams anymore, as this
is performed by the mux itself.
This patch introduces this and creates a pass-through mux called
"mux_pt" which is used for all new connections and which only
calls the data layer's recv,send,wake() calls. One incoming stream
is immediately created when init() is called on the inbound direction.
There should not be any visible impact.
Note that the connection's mux is purposely not set until the session
is completed so that we don't accidently run with the wrong mux. This
must not cause any issue as the xprt_done_cb function is always called
prior to using mux's recv/send functions.
commit 6e01286 (BUG/MAJOR: threads/freq_ctr: fix lock on freq counters)
attempted to fix the loop using volatile but that doesn't work depending
on the level of optimization, resulting in situations where the threads
could remain looping forever. Here we use memory barriers between reads
to enforce a strict ordering and the asm code produced does exactly what
the C code does and works perfectly, with a 3-digit measurement accuracy
observed during a test.
This is needed in the H2->H1 gateway so that we know how long the trailers
block is in chunked encoding. It returns the number of bytes, or 0 if some
are missing, or -1 in case of parse error.
It was a leftover from the last cleaning session; this mask applies
to threads and calling it process_mask is a bit confusing. It's the
same in fd, task and applets.
srv_set_fqdn() may be called with the DNS lock already held, but tries to
lock it anyway. So, add a new parameter to let it know if it was already
locked or not;
In function tv_update_date, we keep an offset reprenting the time deviation to
adjust the system time. At every call, we check if this offset must be updated
or not. Of course, It must be shared by all threads. It was store in a
timeval. But it cannot be atomically updated. So now, instead, we store it in a
64-bits integer. And in tv_update_date, we convert this integer in a
timeval. Once updated, it is converted back in an integer to be atomically
stored.
To store a tv_offset into an integer, we use 32 bits from tv_sec and 32 bits
tv_usec to avoid shift operations.
The wrong bit was set to keep the lock on freq counter update. And the read
functions were re-worked to use volatile.
Moreover, when a freq counter is updated, it is now rotated only if the current
counter is in the past (now.tv_sec > ctr->curr_sec). It is important with
threads because the current time (now) is thread-local. So, rounded to the
second, the time may vary by more or less 1 second. So a freq counter rotated by
one thread may be see 1 second in the future. In this case, it is updated but
not rotated.
There was a flaw in the way the threads was created. the main one was just used
to create all the others and just wait to exit. Now, it is used to run a poll
loop. So we only create nbthread-1 threads.
This also fixes a bug about the compression filter when there is only 1 thread
(nbthread == 1 or no threads support). The bug was in the way thread-local
resources was initialized. per-thread init/deinit callbacks were never called
for the main process. So, with nthread set to 1, some buffers remained
uninitialized.
For now, we don't know if device detection modules (51degrees, deviceatlas and
wurfl) are thread-safe or not. So HAproxy exits with an error when you try to
use one of them with nbthread greater than 1.
We will ask to maintainers of these modules to make them thread-safe or to give
us hints to do so.
Tasks used to process checks are created to be processed by any threads. But,
once a check is started, we must be sure to be sticky on the running thread
because I/O will be also sticky on it. This is a requirement for now: Tasks and
I/O handlers linked to the same session must be executed on the same thread.
By default, no affinity is set for threads. To bind threads on CPU, you must
define a "thread-map" in the global section. The format is the same than the
"cpu-map" parameter, with a small difference. The process number must be
defined, with the same format than cpu-map ("all", "even", "odd" or a number
between 1 and 31/63).
A thread will be bound on the intersection of its mapping and the one of the
process on which it is attached. If the intersection is null, no specific bind
will be set for the thread.
Because there is not migration mechanism yet, all runtime information about an
SPOE agent are thread-local and async exchanges with agents are disabled when we
have serveral threads. Howerver, pipelining is still available. So for now, the
thread part of the SPOE is pretty simple.
We have two y for nsuring that the data is not concurently manipulated:
- locks
- running task on the same thread.
locks are expensives, it is better to avoid it.
This patch cecks that the Lua task run on the same thread that
the stream associated to the coprocess.
TODO: in a next version, the error should be replaced by a yield
and thread migration request.
The applet manipulates the session and its buffers. We have two methods for
ensuring that the memory of the session will not change during its manipulation
by the task:
1 - adding mutex
2 - running on the same threads than the task.
The second point is smart because it cannot lock the execution of another thread.
Note that the Lua processing is not really thread safe. It provides
heavy system which consists to add our own lock function in the Lua
code and recompile the library. This system will probably not accepted
by maintainers of various distribs.
Our main excution point of the Lua is the function lua_resume(). A
quick looking on the Lua sources displays a lua_lock() a the start
of function and a lua_unlock() at the end of the function. So I
conclude that the Lua thread safe mode just perform a mutex around
all execution. So I prefer to do this in the HAProxy code, it will be
easier for distro maintainers.
Note that the HAProxy lua functions rounded by the macro SET_SAFE_LJMP
and RESET_SAFE_LJMP manipulates the Lua stack, so it will be careful
to set mutex around these functions.
The jmpbuf contains pointer on the stack memory address currently use
when the jmpbuf is set. So the information is local to each thread.
The struct field is too big to put it in the stack, but it is used
as buffer for retriving stats values. So, this buffer si local to each
threads. Each function using this buffer, use it whithout break (yield)
so, the consistency of local buffer is ensured.
Now, it is possible to define init_per_thread and deinit_per_thread callbacks to
deal with ressources allocation for each thread.
This is the filter responsibility to deal with concurrency. This is also the
filter responsibility to know if HAProxy is started with some threads. A good
way to do so is to check "global.nbthread" value. If it is greater than 1, then
_per_thread callbacks will be called.
A RW lock has been added to the vars structure to protect each list of
variables. And a global RW lock is used to protect registered names.
When a varibable is fetched, we duplicate sample data because the variable could
be modified by another thread.
When a frequency counter must be updated, we use the curr_sec/curr_tick fields
as a lock, by setting the MSB to 1 in a compare-and-swap to lock and by reseting
it to unlock. And when we need to read it, we loop until the counter is
unlocked. This way, the frequency counters are thread-safe without any external
lock. It is important to avoid increasing the size of many structures (global,
proxy, server, stick_table).
locks have been added in pat_ref and pattern_expr structures to protect all
accesses to an instance of on of them. Moreover, a global lock has been added to
protect the LRU cache used for pattern matching.
Patterns are now duplicated after a successfull matching, to avoid modification
by other threads when the result is used.
Finally, the function reloading a pattern list has been modified to be
thread-safe.
First, OpenSSL is now initialized to be thread-safe. This is done by setting 2
callbacks. The first one is ssl_locking_function. It handles the locks and
unlocks. The second one is ssl_id_function. It returns the current thread
id. During the init step, we create as much as R/W locks as needed, ie the
number returned by CRYPTO_num_locks function.
Next, The reusable SSL session in the server context is now thread-local.
Shctx is now also initialized if HAProxy is started with several threads.
And finally, a global lock has been added to protect the LRU cache used to store
generated certificates. The function ssl_sock_get_generated_cert is now
deprecated because the retrieved certificate can be removed by another threads
in same time. Instead, a new function has been added,
ssl_sock_assign_generated_cert. It must be used to search a certificate in the
cache and set it immediatly if found.
A lock is used to protect accesses to a peer structure.
A the lock is taken in the applet handler when the peer is identified
and released living the applet handler.
In the scheduling task for peers section, the lock is taken for every
listed peer and released at the end of the process task function.
The peer 'force shutdown' function was also re-worked.
A global lock has been added to protect accesses to the list of active
applets. A process mask has also been added on each applet. Like for FDs and
tasks, it is used to know which threads are allowed to process an
applet. Because applets are, most of time, linked to a session, it should be
sticky on the same thread. But in all cases, it is the responsibility of the
applet handler to lock what have to be protected in the applet context.
This is done by passing the right stream's proxy (the frontend or the backend,
depending on the context) to lock the error snapshot used to store the error
info.
The stick table API was slightly reworked:
A global spin lock on stick table was added to perform lookup and
insert in a thread safe way. The handling of refcount on entries
is now handled directly by stick tables functions under protection
of this lock and was removed from the code of callers.
The "stktable_store" function is no more externalized and users should
now use "stktable_set_entry" in any case of insertion. This last one performs
a lookup followed by a store if not found. So the code using "stktable_store"
was re-worked.
Lookup, and set_entry functions automatically increase the refcount
of the returned/stored entry.
The function "sticktable_touch" was renamed "sticktable_touch_local"
and is now able to decrease the refcount if last arg is set to true. It
is allowing to release the entry without taking the lock twice.
A new function "sticktable_touch_remote" is now used to insert
entries coming from remote peers at the right place in the update tree.
The code of peer update was re-worked to use this new function.
This function is also able to decrease the refcount if wanted.
The function "stksess_kill" also handle a parameter to decrease
the refcount on the entry.
A read/write lock is added on each entry to protect the data content
updates of the entry.
A lock for LB parameters has been added inside the proxy structure and atomic
operations have been used to update server variables releated to lb.
The only significant change is about lb_map. Because the servers status are
updated in the sync-point, we can call recalc_server_map function synchronously
in map_set_server_status_up/down function.
This list is used to save changes on the servers state. So when serveral threads
are used, it must be locked. The changes are then applied in the sync-point. To
do so, servers_update_status has be moved in the sync-point. So this is useless
to lock it at this step because the sync-point is a protected area by iteself.
For now, we have a list of each type per thread. So there is no need to lock
them. This is the easiest solution for now, but not the best one because there
is no sharing between threads. An idle connection on a thread will not be able
be used by a stream on another thread. So it could be a good idea to rework this
patch later.
Now, each proxy contains a lock that must be used when necessary to protect
it. Moreover, all proxy's counters are now updated using atomic operations.
First, we use atomic operations to update jobs/totalconn/actconn variables,
listener's nbconn variable and listener's counters. Then we add a lock on
listeners to protect access to their information. And finally, listener queues
(global and per proxy) are also protected by a lock. Here, because access to
these queues are unusal, we use the same lock for all queues instead of a global
one for the global queue and a lock per proxy for others.
2 global locks have been added to protect, respectively, the run queue and the
wait queue. And a process mask has been added on each task. Like for FDs, this
mask is used to know which threads are allowed to process a task.
For many tasks, all threads are granted. And this must be your first intension
when you create a new task, else you have a good reason to make a task sticky on
some threads. This is then the responsibility to the process callback to lock
what have to be locked in the task context.
Nevertheless, all tasks linked to a session must be sticky on the thread
creating the session. It is important that I/O handlers processing session FDs
and these tasks run on the same thread to avoid conflicts.
Many changes have been made to do so. First, the fd_updt array, where all
pending FDs for polling are stored, is now a thread-local array. Then 3 locks
have been added to protect, respectively, the fdtab array, the fd_cache array
and poll information. In addition, a lock for each entry in the fdtab array has
been added to protect all accesses to a specific FD or its information.
For pollers, according to the poller, the way to manage the concurrency is
different. There is a poller loop on each thread. So the set of monitored FDs
may need to be protected. epoll and kqueue are thread-safe per-se, so there few
things to do to protect these pollers. This is not possible with select and
poll, so there is no sharing between the threads. The poller on each thread is
independant from others.
Finally, per-thread init/deinit functions are used for each pollers and for FD
part for manage thread-local ressources.
Now, you must be carefull when a FD is created during the HAProxy startup. All
update on the FD state must be made in the threads context and never before
their creation. This is mandatory because fd_updt array is thread-local and
initialized only for threads. Because there is no pollers for the main one, this
array remains uninitialized in this context. For this reason, listeners are now
enabled in run_thread_poll_loop function, just like the worker pipe.
log buffers and static variables used in log functions are now thread-local. So
there is no need to lock anything to log messages. Moreover, per-thread
init/deinit functions are now used to initialize these buffers.
The function sync_poll_loop is called at the end of each loop inside
run_poll_loop function. It is a protected area where all threads have a chance
to execute tricky tasks with the warranty that no concurrent access is
possible. Of course, it comes with a cost because all threads must be
syncrhonized. So changes must be uncommon.
[WARNING] For now, HAProxy is not thread-safe, so from this commit, it will be
broken for a while, when compiled with threads.
When nbthread parameter is greater than 1, HAProxy will create the corresponding
number of threads. If nbthread is set to 1, nothing should be done. So if there
are concurrency issues (and be sure there will be, unfortunatly), an obvious
workaround is to disable the multithreading...
Each created threads will run a polling loop. So, in a certain way, it is pretty
similar to the nbproc mode ("outside" the bugs and the lock
contention). Nevertheless, there are an init and a deinit steps for each thread
to deal with per-thread allocation.
Each thread has a tid (thread-id), numbered from 0 to (nbtread-1). It is used in
many place to do bitwise operations or to improve debugging information.
A sync-point is a protected area where you have the warranty that no concurrency
access is possible. It is implementated as a thread barrier to enter in the
sync-point and another one to exit from it. Inside the sync-point, all threads
that must do some syncrhonous processing will be called one after the other
while all other threads will wait. All threads will then exit from the
sync-point at the same time.
A sync-point will be evaluated only when necessary because it is a costly
operation. To limit the waiting time of each threads, we must have a mechanism
to wakeup all threads. This is done with a pipe shared by all threads. By
writting in this pipe, we will interrupt all threads blocked on a poller. The
pipe is then flushed before exiting from the sync-point.
hap_register_per_thread_init and hap_register_per_thread_deinit functions has
been added to register functions to do, for each thread, respectively, some
initialization and deinitialization. These functions are added in the global
lists per_thread_init_list and per_thread_deinit_list.
These functions are called only when HAProxy is started with more than 1 thread
(global.nbthread > 1).
This file contains all functions and macros used to deal with concurrency in
HAProxy. It contains all high-level function to do atomic operation
(HA_ATOMIC_*). Note, for now, we rely on "__atomic" GCC builtins to do atomic
operation. So HAProxy can be compiled with the thread support iff these builtins
are available.
It also contains wrappers around plocks to use spin or read/write locks. These
wrappers are used to abstract the internal representation of the locking system
and to add information to help debugging, when compiled with suitable
options.
To add extra info on locks, you need to add DEBUG=-DDEBUG_THREAD or
DEBUG=-DDEBUG_FULL compilation option. In addition to timing info on locks, we
keep info on where a lock was acquired the last time (function name, file and
line). There are also the thread id and a flag to know if it is still locked or
not. This will be useful to debug deadlocks.
Because we can't always display the standard error messages when HAProxy is
started, all alerts and warnings emitted during the startup will now be saved in
a buffer. It can also be handy to store these messages just in case you
missed something during the startup
To implement this feature, Alert and Warning functions now relies on
display_message. The difference is just on conditions to call this function and
it remains unchanged. In display_message, if MODE_STARTING flag is set, we save
the message.
Now memprintf relies on memvprintf. This new function does exactly what
memprintf did before, but it must be called with a va_list instead of a variable
number of arguments. So there is no change for every functions using
memprintf. But it is now also possible to have same functionnality from any
function with variadic arguments.
Email alerts relies on checks to send emails. The link between a mailers section
and a proxy was resolved during the configuration parsing, But initialization was
done when the first alert is triggered. This implied memory allocations and
tasks creations. With this patch, everything is now initialized during the
configuration parsing. So when an alert is triggered, only the memory required
by this alert is dynamically allocated.
Moreover, alerts processing had a flaw. The task handler used to process alerts
to be sent to the same mailer, process_email_alert, was designed to give back
the control to the scheduler when an alert was sent. So there was a delay
between the sending of 2 consecutives alerts (the min of
"proxy->timeout.connect" and "mailer->timeout.mail"). To fix this problem, now,
we try to process as much queued alerts as possible when the task is woken up.
An email alert contains a list of tcpcheck_rule. Each one is dynamically
allocated, just like its internal members. So, when an email alerts is freed, we
must be sure to properly free each tcpcheck_rule too.
This patch must be backported in 1.7 and 1.6.
This is a huge patch with many changes, all about the DNS. Initially, the idea
was to update the DNS part to ease the threads support integration. But quickly,
I started to refactor some parts. And after several iterations, it was
impossible for me to commit the different parts atomically. So, instead of
adding tens of patches, often reworking the same parts, it was easier to merge
all my changes in a uniq patch. Here are all changes made on the DNS.
First, the DNS initialization has been refactored. The DNS configuration parsing
remains untouched, in cfgparse.c. But all checks have been moved in a post-check
callback. In the function dns_finalize_config, for each resolvers, the
nameservers configuration is tested and the task used to manage DNS resolutions
is created. The links between the backend's servers and the resolvers are also
created at this step. Here no connection are kept alive. So there is no needs
anymore to reopen them after HAProxy fork. Connections used to send DNS queries
will be opened on demand.
Then, the way DNS requesters are linked to a DNS resolution has been
reworked. The resolution used by a requester is now referenced into the
dns_requester structure and the resolution pointers in server and dns_srvrq
structures have been removed. wait and curr list of requesters, for a DNS
resolution, have been replaced by a uniq list. And Finally, the way a requester
is removed from a DNS resolution has been simplified. Now everything is done in
dns_unlink_resolution.
srv_set_fqdn function has been simplified. Now, there is only 1 way to set the
server's FQDN, independently it is done by the CLI or when a SRV record is
resolved.
The static DNS resolutions pool has been replaced by a dynamoc pool. The part
has been modified by Baptiste Assmann.
The way the DNS resolutions are triggered by the task or by a health-check has
been totally refactored. Now, all timeouts are respected. Especially
hold.valid. The default frequency to wake up a resolvers is now configurable
using "timeout resolve" parameter.
Now, as documented, as long as invalid repsonses are received, we really wait
all name servers responses before retrying.
As far as possible, resources allocated during DNS configuration parsing are
releases when HAProxy is shutdown.
Beside all these changes, the code has been cleaned to ease code review and the
doc has been updated.
The cli command to show resolvers stats is in conflict with the command to show
proxies and servers stats. When you use the command "show stat resolvers [id]",
instead of printing stats about resolvers, you get the stats about all proxies
and servers.
Now, to avoid conflict, to print resolvers stats, you must use the following
command:
show resolvers [id]
This patch must be backported in 1.7.
The messages processing is done using existing functions. So here, the main task
is to find the SPOE engine to use. To do so, we loop on all filter instances
attached to the stream. For each, we check if it is a SPOE filter and, if yes,
if its name is the one used to declare the "send-spoe-group" action.
We also take care to return an error if the action processing is interrupted by
HAProxy (because of a timeout or an error at the HAProxy level). This is done by
checking if the flag ACT_FLAG_FINAL is set.
The function spoe_send_group is the action_ptr callback ot
Because we can have messages chained by event or by group, we need to have a way
to know which kind of list we manipulate during the encoding. So 2 types of list
has been added, SPOE_MSGS_BY_EVENT and SPOE_MSGS_BY_GROUP. And the right type is
passed when spoe_encode_messages is called.
This action is used to trigger sending of a group of SPOE messages. To do so,
the SPOE engine used to send messages must be defined, as well as the SPOE group
to send. Of course, the SPOE engine must refer to an existing SPOE filter. If
not engine name is provided on the SPOE filter line, the SPOE agent name must be
used. For example:
http-request send-spoe-group my-engine some-group
This action is available for "tcp-request content", "tcp-response content",
"http-request" and "http-response" rulesets. It cannot be used for tcp
connection/session rulesets because actions for these rulesets cannot yield.
For now, the action keyword is parsed and checked. But it does nothing. Its
processing will be added in another patch.
For now, this section is only parsed. It should have the following format:
spoe-group <grp-name>
messages <msg-name> ...
And then SPOE groups must be referenced in spoe-agent section:
spoe-agnt <name>
...
groups <grp-name> ...
The purpose of these groups is to trigger messages sending from TCP or HTTP
rules, directly from HAProxy configuration, and not on specific event. This part
will be added in another patch.
It is important to note that a message belongs at most to a group.
The engine name is now kept in "spoe_config" struture. Because a SPOE filter can
be declared without engine name, we use the SPOE agent name by default. Then,
its uniqness is checked against all others SPOE engines configured for the same
proxy.
* TODO: Add documentation
Now, it is possible to conditionnaly send a SPOE message by adding an ACL-based
condition on the "event" line, in a "spoe-message" section. Here is the example
coming for the SPOE documentation:
spoe-message get-ip-reputation
args ip=src
event on-client-session if ! { src -f /etc/haproxy/whitelist.lst }
To avoid mixin with proxy's ACLs, each SPOE message has its private ACL list. It
possible to declare named ACLs in "spoe-message" section, using the same syntax
than for proxies. So we can rewrite the previous example to use a named ACL:
spoe-message get-ip-reputation
args ip=src
acl ip-whitelisted src -f /etc/haproxy/whitelist.lst
event on-client-session if ! ip-whitelisted
ACL-based conditions are executed in the context of the stream that handle the
client and the server connections.
"check_http_req_capture" and "check_http_res_capture" functions have been added
to check validity of "http-request capture" and "http-response capture"
rules. Code for these functions come from cfgparse.c.
SPOE filter can be declared without engine name. This is an optional
parameter. But in this case, no scope must be used in the SPOE configuration
file. So engine name and scope are both undefined, and, obviously, we must not
try to compare them.
This patch must be backported in 1.7.
It was painful not to have the status code available, especially when
it was computed. Let's store it and ensure we don't claim content-length
anymore on 1xx, only 0 body bytes.
This patch reorganize the shctx API in a generic storage API, separating
the shared SSL session handling from its core.
The shctx API only handles the generic data part, it does not know what
kind of data you use with it.
A shared_context is a storage structure allocated in a shared memory,
allowing its usage in a multithread or a multiprocess context.
The structure use 2 linked list, one containing the available blocks,
and another for the hot locked blocks. At initialization the available
list is filled with <maxblocks> blocks of size <blocksize>. An <extra>
space is initialized outside the list in case you need some specific
storage.
+-----------------------+--------+--------+--------+--------+----
| struct shared_context | extra | block1 | block2 | block3 | ...
+-----------------------+--------+--------+--------+--------+----
<-------- maxblocks --------->
* blocksize
The API allows to store content on several linked blocks. For example,
if you allocated blocks of 16 bytes, and you want to store an object of
60 bytes, the object will be allocated in a row of 4 blocks.
The API was made for LRU usage, each time you get an object, it pushes
the object at the end of the list. When it needs more space, it discards
The functions name have been renamed in a more logical way, the part
regarding shctx have been prefixed by shctx_ and the functions for the
shared ssl session cache have been prefixed by sh_ssl_sess_.
Move the ssl callback functions of the ssl shared session cache to
ssl_sock.c. The shctx functions still needs to be separated of the ssl
tree and data.
It's important for the H2 to H1 gateway that the response parser properly
clears the H1 message's body_len when seeing these status codes so that we
don't hang waiting to transfer data that will not come.
A bind_conf does contain a ssl_bind_conf, which already has a flag to know
if early data are activated, so use that, instead of adding a new flag in
the ssl_options field.
Add a new sample fetch, "ssl_fc_has_early", a boolean that will be true
if early data were sent, and a new action, "wait-for-handshake", if used,
the request won't be forwarded until the SSL handshake is done.
When compiled with Openssl >= 1.1.1, before attempting to do the handshake,
try to read any early data. If any early data is present, then we'll create
the session, read the data, and handle the request before we're doing the
handshake.
For this, we add a new connection flag, CO_FL_EARLY_SSL_HS, which is not
part of the CO_FL_HANDSHAKE set, allowing to proceed with a session even
before an SSL handshake is completed.
As early data do have security implication, we let the origin server know
the request comes from early data by adding the "Early-Data" header, as
specified in this draft from the HTTP working group :
https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-replay
Use Openssl-1.1.1 SSL_CTX_set_client_hello_cb to mimic BoringSSL early callback.
Native multi certificate and SSL/TLS method per certificate is now supported by
Openssl >= 1.1.1.
switchctx early callback is only supported for BoringSSL. To prepare
the support of openssl 1.1.1 early callback, convert CBS api to neutral
code to work with any ssl libs.
This patch simply brings HAProxy internal regex system to the Lua API.
Lua doesn't embed regexes, now it inherits from the regexes compiled
with haproxy.
Allow to register a function which will be called after the
configuration file parsing, at the end of the check_config_validity().
It's useful fo checking dependencies between sections or for resolving
keywords, pointers or values.
This commit implements a post section callback. This callback will be
used at the end of a section parsing.
Every call to cfg_register_section must be modified to use the new
prototype:
int cfg_register_section(char *section_name,
int (*section_parser)(const char *, int, char **, int),
int (*post_section_parser)());
Calls to build_logline() are audited in order to use dynamic trash buffers
allocated by alloc_trash_chunk() instead of global trash buffers.
This is similar to commits 07a0fec ("BUG/MEDIUM: http: Prevent
replace-header from overwriting a buffer") and 0d94576 ("BUG/MEDIUM: http:
prevent redirect from overwriting a buffer").
This patch should be backported in 1.7, 1.6 and 1.5. It relies on commit
b686afd ("MINOR: chunks: implement a simple dynamic allocator for trash
buffers") for the trash allocator, which has to be backported as well.
When switching the check code to a non-permanent connection, the new code
forgot to free the connection if an error happened and was returned by
connect_conn_chk(), leading to the check never be ran again.
Now when ssl_sock_{to,from}_buf are called, if the connection doesn't
feature CO_FL_WILL_UPDATE, they will first retrieve the updated flags
using conn_refresh_polling_flags() before changing any flag, then call
conn_cond_update_sock_polling() before leaving, to commit such changes.
Now when raw_sock_{to,from}_{pipe,buf} are called, if the connection
doesn't feature CO_FL_WILL_UPDATE, they will first retrieve the updated
flags using conn_refresh_polling_flags() before changing any flag, then
call conn_cond_update_sock_polling() before leaving, to commit such
changes. Note that the only real call to one of the __conn_* functions
is in fact in conn_sock_read0() which is called from here.
In transport-layer functions (snd_buf/rcv_buf), it's very problematic
never to know if polling changes made to the connection will be propagated
or not. This has led to some conn_cond_update_polling() calls being placed
at a few places to cover both the cases where the function is called from
the upper layer and when it's called from the lower layer. With the arrival
of the MUX, this becomes even more complicated, as the upper layer will not
have to manipulate anything from the connection layer directly and will not
have to push such updates directly either. But the snd_buf functions will
need to see their updates committed when called from upper layers.
The solution here is to introduce a connection flag set by the connection
handler (and possibly any other similar place) indicating that the caller
is committed to applying such changes on return. This way, the called
functions will be able to apply such changes by themselves before leaving
when the flag is not set, and the upper layer will not have to care about
that anymore.
SSL records are 16kB max. When trying to send larger data chunks at once,
SSL_read() only processes 16kB and ssl_sock_from_buf() believes it means
the system buffers are full, which is not the case, contrary to raw_sock.
This is particularly noticeable with HTTP/2 when using a 64kB buffer with
multiple streams, as the mux buffer can start to fill up pretty quickly
in this situation, slowing down the data delivery.
We've been keep this test for a connection being established since 1.5-dev14
when the stream-interface was still accessing the FD directly. The test on
CO_FL_HANDSHAKE and L{4,6}_CONN is totally useless here, and can even be
counter-productive on pure TCP where it could prevent a request from being
sent on a connection still attempting to complete its establishment. And it
creates an abnormal dependency between the layers that will complicate the
implementation of the mux, so let's get rid of it now.
This is based on the git SHA1 implementation and optimized to do word
accesses rather than byte accesses, and to avoid unnecessary copies into
the context array.
in 32af203b75 ("REORG: cli: move ssl CLI functions to ssl_sock.c")
"set ssl tls-key" was accidentally replaced with "set ssl tls-keys"
(keys instead of key). This is undocumented and breaks upgrades from
1.6 to 1.7.
This patch restores "set ssl tls-key" and also registers a helptext.
This should be backported to 1.7.
Commit 872085ce "BUG/MINOR: ssl: ocsp response with 'revoked' status is correct"
introduce a regression. OCSP_single_get0_status can return -1 and haproxy must
generate an error in this case.
Thanks to Sander Hoentjen who have spotted the regression.
This patch should be backported in 1.7, 1.6 and 1.5 if the patch above is
backported.