Disable the cache if the append of data failed, it should never happen
because the allocated row size is at least equal to the size of the
object to allocate.
Forward the remaining headers with the data in the first call of
cache_store_http_forward_data().
Previously the headers were forwarded first, and the function left,
implying an additionnal call to cache_store_http_forward_data() for the
data.
Cc: Christopher Faulet <cfaulet@haproxy.com>
Use msg->sov to forward headers instead of msg->eoh. It can causes some
problem because eoh does not contains the last \r\n, and the filter does
not support to send the headers partially.
Cc: Christopher Faulet <cfaulet@haproxy.com>
If haproxy is started using the name of the binary only (i.e.
not using a relative or absolute path) the `execv` in
`mworker_reload` fails with `ENOENT`, because it does not
examine the `PATH`:
[WARNING] 315/161139 (7) : Reexecuting Master process
[WARNING] 315/161139 (7) : Cannot allocate memory
[WARNING] 315/161139 (7) : Failed to reexecute the master processs [7]
The error messages are misleading, because the return value of
`execv` is not checked. This should be fixed in a separate commit.
Once this happened the master process ignores any further
signals sent by the administrator.
Replace `execv` with `execvp` to establish the expected
behaviour.
This bug was introduced in commit 73b85e75b3.
This macro should be used to declare variables or struct members depending on
the USE_THREAD compile option. It avoids the encapsulation of such declarations
between #ifdef/#endif. It is used to declare all lock variables.
In spoe_acquire_buffer and spoe_release_buffer, instead of checking the buffer
against buf_empty, we now check its size. It is important because when an
allocation fails, it will be set to buf_wanted. In both cases, the size is 0.
It is a proactive bug fix, no real problem was observed till now. It cannot be
backported as is in 1.7 because of all changes made on the SPOE in 1.8.
Marcus Rückert reported that commit d8b3b65 ("BUG/MEDIUM: splice/threads:
pipe reuse list was not protected.") broke threadless support. Add the
required #ifdef.
In the case of Transfer-Encoding: chunked, there is no Content-Length
which causes the cache to allocate a too small shctx row for the data.
It's not possible to allocate a shctx row for the chunks, we need to be
able to allocate on-the-fly the shctx blocks during the data transfer.
This method was reserved for the HTTP/2 connection preface, must never
be used and must be rejected. In normal situations it doesn't happen,
but it may be visible if a TCP frontend has alpn "h2" enabled, and
forwards to an HTTP backend which tries to parse the request. Before
this patch it would pass the wrong request to the backend server, now
it properly returns 400 bad req.
This patch should probably be backported to stable versions.
At a number of places, bitmasks are used for process affinity and to map
listeners to processes. Every time 1UL<<(relative_pid-1) is used. Let's
create a "pid_bit" variable corresponding to this value to clean this up.
It happens that no single analyser has ever needed to set res.analyse_exp,
so that process_stream() didn't consider it when computing the next task
expiration date. Since Lua actions were introduced in 1.6, this can be
needed on http-response actions for example, so let's ensure it's properly
handled.
Thanks to Nick Dimov for reporting this bug. The fix needs to be
backported to 1.7 and 1.6.
This is useful to know what thread(s) an fd is scheduled to be
handled on. It's worth noting that at the moment the "show fd"d
doesn't seem totally thread-safe.
In commit 53a4766 ("MEDIUM: connection: start to introduce a mux layer
between xprt and data") we introduced a release() function which ends
up never being used. Let's get rid of it now.
When a stream_interface performs a shutw() then a shutr(), the stream
is marked closed. Then cs_destroy() calls h2_detach() and it cannot
fail since we're on the leaving path of the caller. The problem is that
in order to close streams we usually have to send either an emty DATA
frame with the ES flag set or an RST_STREAM frame, and the mux buffer
might already be full, forcing the stream to be queued. The forced
removal of this stream causes this last message to silently disappear,
and the client to wait forever for a response.
This commit ensures we can detach the conn_stream from the h2 stream
if the stream is blocked, effectively making the h2 stream an orphan,
ensures that the mux can deal with orphaned streams after processing
them, and that the demux can kill them upon receipt of GOAWAY.
There is an issue with how the RST_STREAM frames are sent. Some of
them are sent from the demux, either for valid or for closed streams,
and some are sent from the mux always for valid streams. At the moment
the demux stream ID is used, which is wrong for all streams being muxed,
and sometimes results in certain bad HTTP responses causing the emission
of an RST_STREAM referencing stream zero. In addition, the stream's
blocked flags could be updated even if the stream was the closed or
idle ones.
We really need to split the function for the two distinct use cases where
one is used to send an RST on a condition detected at the connection level
(such as a closed stream) and the other one is used to send an RST for a
condition detected at the stream level. The first one is used only in the
demux, and the other one only by a valid stream.
To be thread safe, the function pattern_exec_match copy data (the pattern and
the inner sample) in thread-local variables. But when the sample is duplicated,
we must check its type and not the pattern one.
This is specific to threads, no backport is needed.
When a write activity is reported on a channel, it is important to keep this
information for the stream because it take part on the analyzers' triggering.
When some data are written, the flag CF_WRITE_PARTIAL is set. It participates to
the task's timeout updates and to the stream's waking. It is also used in
CF_MASK_ANALYSER mask to trigger channels anaylzers. In the past, it was cleared
by process_stream. Because of a bug (fixed in commit 95fad5ba4 ["BUG/MAJOR:
stream-int: don't re-arm recv if send fails"]), It is now cleared before each
send and in stream_int_notify. So it is possible to loss this information when
process_stream is called, preventing analyzers to be called, and possibly
leading to a stalled stream.
Today, this happens in HTTP2 when you call the stat page or when you use the
cache filter. In fact, this happens when the response is sent by an applet. In
HTTP1, everything seems to work as expected.
To fix the problem, we need to make the difference between the write activity
reported to lower layers and the one reported to the stream. So the flag
CF_WRITE_EVENT has been added to notify the stream of the write activity on a
channel. It is set when a send succedded and reset by process_stream. It is also
used in CF_MASK_ANALYSER. finally, it is checked in stream_int_notify to wake up
a stream and in channel_check_timeouts.
This bug is probably present in 1.7 but it seems to have no effect. So for now,
no needs to backport it.
If the H1 parser would report a status code length not consisting in
exactly 3 digits, the error case was confused with a lack of buffer
room and was causing the parser to loop infinitely.
The H1 parser used by the H2 gateway was a bit lax and could validate
non-numbers in the status code. Since it computes the code on the fly
it's problematic, as "30:" is read as status code 310. Let's properly
check that it's a number now. No backport needed.
The build breaks on a machine without openssl/crypto.h because shctx
still loads openssl-compat.h while it doesn't need it anymore since
the code was moved :
In file included from src/shctx.c:20:0:
include/proto/openssl-compat.h:3:28: fatal error: openssl/crypto.h: No such file or directory
#include <openssl/crypto.h>
Just remove include openssl-compat from shctx.
Commit 522eea7 ("MINOR: ssl: Handle sending early data to server.") added
a dependency on SRV_SSL_O_EARLY_DATA which only exists when USE_OPENSSL
is defined (which is probably not the best solution) and breaks the build
when ssl is not enabled. Just add an ifdef USE_OPENSSL around the block
for now.
This adds a new keyword on the "server" line, "allow-0rtt", if set, we'll try
to send early data to the server, as long as the client sent early data, as
in case the server rejects the early data, we no longer have them, and can't
resend them, so the only option we have is to send back a 425, and we need
to be sure the client knows how to interpret it correctly.
With TLS 1.3, session aren't established until after the main handshake
has completed. So we can't just rely on calling SSL_get1_session(). Instead,
we now register a callback for the "new session" event. This should work for
previous versions of TLS as well.
My recent change in commit ce4e0aa ("MEDIUM: task: change the construction
of the loop in process_runnable_tasks()") was bogus as it used to keep the
rq_next across an unlock/lock sequence, occasionally leading to crashes for
tasks that are eligible to any thread. We must use the lookup call for each
new batch instead. The problem is easily triggered with such a configuration :
global
nbthread 4
listen check
mode http
bind 0.0.0.0:8080
redirect location /
option httpchk GET /
server s1 127.0.0.1:8080 check inter 1
server s2 127.0.0.1:8080 check inter 1
Thanks to Olivier for diagnosing this one. No backport is needed.
Commit 4ac4928 ("BUG/MINOR: stream-int: don't set MSG_MORE on SHUTW_NOW
without AUTO_CLOSE") was incomplete. H2 reveals another situation where
the input stream is marked closed with the request and we set MSG_MORE,
causing a delay before the request leaves.
Better avoid setting the flag on the request path for close cases in
general.
As part of the detection for intentional closes, we can kill the
connection if a shutw() happens before the headers. But it can also
happen that an invalid response is not properly parsed, preventing
any headers frame from being sent and making the function believe
it was an abort. Now instead we check if any response was received
from the stream, regardless of the fact that it was properly
converted.
It's pointless to requeue the task when we're closing, so swap the
order of the task_queue() and h2_release(). It also matches what
was written in the comment regarding re-arming the timer.
h2_detach() is called after a stream was closed, and it evaluates if it's
worth closing the connection. The issue there is that the connection is
closed too early in case there's demand for closing after the last stream,
even if some data remain in the mux. Let's change the condition to check
for this.
When the assignment of the connection state was moved into h2c_error(),
3 of them were missed because they were wrong, using H2_SS_ERROR instead.
This resulted in the connection's state being set to H2_CS_ERROR2 in fact,
so the error was not properly sent.
This one was created to maintain the knowledge that a stream was closed
after having sent an RST_STREAM frame but that's not needed anymore and
it confuses certain conditions on the error processing path. It's time
to get rid of it.
The call to xprt->snd_buf() was not conditionned on the presence of
data in the buffer, resulting in snd_buf() returning 0 and never
disabling the polling. It was revealed by the previous bug on error
processing but must properly be handled.
Some stream errors are detected on the MUX path (eg: H1 response
encoding). The ones forgot to emit an RST_STREAM frame, causing the
client to wait and/or to see the connection being immediately closed.
This is now fixed.
This flag was added after the GOAWAY flags were introduced and mistakenly
placed in the connection, but that doesn't make sense as it's specific to
the stream. The main impact is the risk of returning a DATA0+ES frame for
an error instead of an RST_STREAM.
snr_check_ip_callback() may be called with the server lock, so don't attempt
to lock it again, instead, make sure the callers always have the lock before
calling it.
The scheduler is complex and uses local queues to amortize the cost of
locks. But all this comes with a cost that is quite observable with
single-thread workloads.
The purpose of this patch is to reimplement the much simpler scheduler
for the case where threads are not used. The code is very small and
simple. It doesn't impact the multi-threaded performance at all, and
provides a nice 10% performance increase in single-thread by reaching
606kreq/s on the tests that showed 550kreq/s before.
process_runnable_tasks() needs to requeue or wake up tasks after
processing them in batches. By only refilling the existing ones, we
avoid revisiting all the queue. The performance gain is measurable
starting with two threads, where the request rate climbs to 657k/s
compared to 644k.
This patch slightly rearranges the loop to pack the locked code a little
bit, and to try to concentrate accesses to the tree together to benefit
more from the cache.
It also fixes how the loop handles the right margin : now that is guaranteed
that the retrieved nodes are filtered to only match the current thread, we
don't need to rewind every 16 entries. Instead we can rewind each time we
reach the right margin again.
With this change, we now achieve the following performance for 10 H2 conns
each containing 100 streams :
1 thread : 550kreq/s
2 thread : 644kreq/s
3 thread : 598kreq/s
This function is sensitive, let's make it shorter by factoring out the
unlock and leave code. This reduced the function's size by a few tens
of bytes and increased the overall performance by about 1%.
Currently the task scheduler suffers from an O(n) lookup when
skipping tasks that are not for the current thread. The reason
is that eb32_lookup_ge() has no information about the current
thread so it always revisits many tasks for other threads before
finding its own tasks.
This is particularly visible with HTTP/2 since the number of
concurrent streams created at once causes long series of tasks
for the same stream in the scheduler. With only 10 connections
and 100 streams each, by running on two threads, the performance
drops from 640kreq/s to 11.2kreq/s! Lookup metrics show that for
only 200000 task lookups, 430 million skips had to be performed,
which means that on average, each lookup leads to 2150 nodes to
be visited.
This commit backports the principle of scope lookups for ebtrees
from the ebtree_v7 development tree. The idea is that each node
contains a mask indicating the union of the scopes for the nodes
below it, which is fed during insertion, and used during lookups.
Then during lookups, branches that do not contain any leaf matching
the requested scope are simply ignored. This perfectly matches a
thread mask, allowing a thread to only extract the tasks it cares
about from the run queue, and to always find them in O(log(n))
instead of O(n). Thus the scheduler uses tid_bit and
task->thread_mask as the ebtree scope here.
Doing this has recovered most of the performance, as can be seen on
the test below with two threads, 10 connections, 100 streams each,
and 1 million requests total :
Before After Gain
test duration : 89.6s 4.73s x19
HTTP requests/s (DEBUG) : 11200 211300 x19
HTTP requests/s (PROD) : 15900 447000 x28
spin_lock time : 85.2s 0.46s /185
time per lookup : 13us 40ns /325
Even when going to 6 threads (on 3 hyperthreaded CPU cores), the
performance stays around 284000 req/s, showing that the contention
is much lower.
A test showed that there's no benefit in using this for the wait queue
though.
The first pid in the pidfile is now the parent, it's more convenient for
supervising the processus.
You can now reload haproxy in master-worker mode with convenient command
like: kill -USR2 $(head -1 /tmp/haproxy.pid)
Commit 0493149 ("MINOR: thread: report multi-thread support in haproxy -vv")
added information about thread support in haproxy -vv output but accidently
marked the message as "must_free" while it's a constant. This causes a segv
on the old process on clean exit if threads are enabled. It doesn't affect
the stability during operations however.
unbind_listener() takes the listener lock, which is already held by
enable_listener(). This situation happens when starting with nbproc > 1
with some bind lines limited to a certain process, because in this case
enable_listener() tries to stop unneeded listeners.
This commit introduces __do_unbind_listeners() which must be called with
the lock held, and makes enable_listener() use this one. Given that the
only return code has never been used and that it starts to make the code
more complicated to propagate it before throwing it to the trash, the
function's return type was changed to void.
The spin_unlock() was called just before setting the expiry to
TICK_ETERNITY, so if another thread has the time to perform its
update and set a timeout, this would would clear it.
Commit c3680ec ("MINOR: add severity information to cli feedback messages")
introduced a severity level to CLI messages, but one of them was missed
on "set server addr". No backport is needed.
Given that all spinning loops we've had since 1.8-rc1 were caused by
unbalanced lock/unlock, let's get rid of all return statements in the
locked check functions and only exit via a a single unlock place.
The "set server <srv> check-port" CLI handler forgot to return after
detecting an error on the port number, and still proceeds with the action.
This needs to be backported to 1.7.
Some unlocks were missing, resulting in deadlocks even with a single thread.
We really need to make these functions safer by getting rid of all those
remaining "return" calls and only leave using a goto!
Using peers or stick table we could update an freq_ctr
using a tick value with the first bit set but this
bit is reserved for lock since multithreading support.
Fix bugs due to missing unlock and recursive lock performing
http health check.
The server's lock scope was enlarged to protect all callers
of 'set_server_check_status' and 'chk_report_conn_err'.
This fix also protects tcpcheck against concurrency.
Before introducing the mux layer, tcp_connect() would poll for sending
to detect the connection establishment. It happens that the health
checks have apparently never explicitly enabled this polling and have
been relying on this implicit one.
Now that there's the mux layer, the conn_stream needs to be enabled
for polling as well and since it's not done in the checks, it's never
done and the check's request doesn't leave the machine, as can be
noticed with http checks.
The solution simply consists in going back to the well-known case
where we enable polling after connecting using cs_want_send() if we
have anything but just a plain connect(). The regular data path is
not affected because the stream interface code automatically computes
the polling needs based on buffer contents.
This situation which must not happen does in fact happen when feeding
artificial responses using errorfiles, Lua or an applet. For now it
causes the H1 response parser to loop forever trying to get a more
complete response. Since it cannot progress, let's return an error.
Don't bother testing if len is nonzero, we know it is, as we're in the
"else" part of a if (!len), and testing it confuses clang into thinking
ret may be left uninitialized.
Store object in the cache. The cache use an shctx for storage.
It uses an http-response action to store the headers and a filter to
store the body. The http-response action is used in order to allow
modifications by other actions before caching.
Forbid shctx to read more than expected, it allows you to use a greater
value as a len with shctx_row_data_get(), the size of the destination
buffer for example.
Previous commit ea3928 (MEDIUM: h2: apply a timeout to h2 connections)
was wrong for two reasons. The first one is that if the client timeout
is not set, it's used as zero, preventing connections from establishing.
The second reason is that if the timeout triggers with active streams
(normally it should not since the task is supposed to be disabled), the
task is removed (h2c->task=NULL), and the last quitting stream might
try to dereference it.
Instead of doing this, we simply not register the task if there's no
timeout (it's useless) and we always control its presence in the streams.
Till now there was no way to deal with a dead H2 connection. Now each
connection creates a task that wakes up to kill the connection. Its
timeout is constantly refreshed when there's some activity. In case
the timeout triggers, the best effort attempts are made at sending a
clean GOAWAY message before closing and signaling the streams.
The timeout is automatically disabled when there's an active stream on
the connection, and restarted when the last stream finishes. This way
it should not affect long sessions.
Given that we're processing data produced by haproxy, we know that the
situations where haproxy doesn't return anything are :
- request timeout with option http-ignore-probes : there's no reason to
hit this since we're creating the stream with the request into it ;
- tcp-request content reject : this definitely means we want to kill the
connection and abort keep-alive and any further processing ;
- using /dev/null as the error file to hide an error
In practice it appears that using the abort on empty response as a hint to
trigger a connection close is very appropriate to continue to give the
control over the connection management. This patch thus tries to send a
GOAWAY frame with the max_id presented as the last stream ID, then sends
an RST_STREAM for the current stream. For the client, this means that the
connection must be shut down immediately after processing the last pending
streams and that the current stream is aborted. This way it's still possible
to force connections to be closed using tcp-request rules.
After some long brainstorming sessions, it appears that "Connection: close"
seems to be the best signal from the L7 layer to indicate the need to close
the connection. Indeed, in H1 it is only present in very rare cases (eg:
certain unrecoverable errors, some of which could remove it now by the way).
It will also be added when the L7 layer wants to force the connection to
terminate. By default when running in keep-alive mode it is not present.
It's worth mentionning that in H1 with persistent connections, we have sort
of a concurrency-1 mux and this header field is used the same way.
Thus here this patch detects "Connection: close" in response headers and
if seen, sends a GOAWAY frame with the highest possible ID so that the
client knows that it can quit whenever it wants to. If more aggressive
closures are needed in the future, we may decide to advertise the max_id
to abort after the current requests and better honor "http-request deny".
For a graceful shutdown, the specs requries to discard frames with a
stream ID higher than the advertised last_id. (RFC7540#6.8). Well,
finally for now the code is disabled (see last page of #6.8). Some
frames need to be processed anyway to maintain the compression state
and the flow control window state, but we don't have any trivial way
to do this and ignore them at the same time. For the headers it's
the worst case where we can't parse headers frames without coming
from the streams, and we don't want to create such streams as we'd
have to abort them, and aborting would cause errors to flow back.
Possibly that a longterm solution might involve using some dummy
streams and dummy buffers for this and calling the parsers directly.
RFC7540#5.1 is pretty clear : "any frame other than WINDOW_UPDATE,
PRIORITY, or RST_STREAM in this state MUST be treated as a connection
error of type STREAM_CLOSED". Instead of dealing with this for each
and every frame type, let's do it once for all in the main demux loop.
RFC7540#5.1 is pretty clear : "any frame other than HEADERS or PRIORITY
in this state MUST be treated as a connection error". Instead of dealing
with this for each and every frame type, let's do it once for all in the
main demux loop.
The ID is respected, and only IDs greater than the advertised last_id
are woken up, with a CS_FL_ERROR flag to signal that the stream is
aborted. This is necessary for a browser to abort a download or to
reject a bad response that affects the connection's state.
Let's replace h2_wake_all_streams() with h2_wake_some_streams(), to
support signaling only streams by their ID (for GOAWAY frames) and
to pass the flags to add on the conn_stream.
When a stream sends a shutw, we send an empty DATA frame with the ES
flag set, except if no HEADERS were sent, in which case we rather send
RST_STREAM. On shutr(1) to abort a request, an RST_STREAM frame is sent
if the stream is OPEN and the stream is closed. Care is taken to switch
the stream's state accordingly and to avoid sending an ES bit again or
another RST once already done.
Data frames are received and transmitted. The per-connection and
per-stream amount of data to ACK is automatically updated. Each
DATA frame is ACKed because usually the downstream link is large
and the upstream one is small, so it seems better to waste a few
bytes every few kilobytes to maintain a low ACK latency and help
the sender keep the link busy. The connection's ACK however is
sent at the end of the demux loop and at the beginning of the mux
loop so that a single aggregated one is emitted (connection
windows tend to be much larger than stream windows).
A future improvement would consist in sending a single ACK for
multiple subsequent DATA frames of the same stream (possibly
interleaved with window updates frames), but this is much trickier
as it also requires to remember the ID of the stream for which
DATA frames have to be sent.
Ideally in the near future we should chunk-encode the body sent
to HTTP/1 when there's no content length and when the request is
not a CONNECT. It's just uncertain whether it's the best option
or not for now.
When it is detected that the number of received bytes is > 0 on the
connection at the end of the demux call or before starting to process
pending output data, an attempt is made at sending a WINDOW UPDATE on
the connection. In case of failure, it's attempted later.
For now we don't build a HEADERS frame with them, but at least we remove
them from the response so that the L7 chunk parser inside isn't blocked
on these (often two) remaining bytes that don't want to leave the buffer.
It also ensures that trailers delivered progressively will correctly be
skipped.
The H1 response data are processed (either following content-length or
chunks) and emitted as H2 DATA frames. In the case of content-length,
the maximum size permitted by the mux buffer, the max frame size, the
connection's window and the stream's window it used to determine the
frame size. For chunked encoding, the same limitation applies, but in
addition, each chunk leads to a distinct frame. This could be improved
in the future to aggregate chunks into larger frames.
Streams blocked on the connection's flow control subscribe to the
connection's fctl_list to be woken up when the window opens again.
Streams blocked on their own flow control don't subscribe to anything,
they just sit waiting for window update frames to reopen the window.
The connection-close mode (without content-length) partially works thanks
to the fact that the SHUTW event leads to a close of the stream. In
practice an empty DATA frame should be sent in this case though.
This calls the h1 response parser and feeds the output through the hpack
encoder to produce stateless HPACK bytecode into an output chunk. For now
it's a bit naive but reasonably efficient.
The HPACK encoder relies on hpack_encode_header() so that the most common
response header fields are encoded based on the static header table. The
forbidden header field names (connection, proxy-connection, upgrade,
transfer-encoding, keep-alive) are dropped before calling the hpack
encoder.
A new flag (H2_CF_HEADERS_SENT) is set once such a frame is emitted. It
will be used to know if we can send an empty DATA+ES frame to use as a
shutdown() signal or if we have to use RST_STREAM.
The trash is already used by the hpack layer and for Huffman decoding,
it's unsafe to use here as a buffer and results in corrupted data. Use
a safely allocated trash instead.
This takes care of creating a new h2s and a new conn_stream when a
HEADERS frame arrives. The recv() callback from the data layer is then
called to extract the frame into the stream's buffer. It is verified
that the stream ID is strictly greater than the known max stream ID.
And the last_id is updated if the current request is properly converted.
The streams are created in open or half-closed(remote) states.
For now there are some limitations :
- frames without END_HEADERS are rejected (CONTINUATION not supported
yet, will require some more changes so that the stream processor
checks the H2 frame header by itself and steals the frames from the
connection)
- padding/stream_dep/priority are currently ignored
- limited error handling, could be improved
But at least the request is properly decoded, transcoded and processed.
If a stream is killed for whatever reason and it happens to be the one
currently blocking the connection, we must unblock the connection and
enable polling again so that it can attempt to make progress. This may
happen for example on upload timeout, where the demux is blocked due to
a full stream buffer, and the stream dies on server timeout and quits.
This does the very minimum required to release a stream and/or a connection
upon the stream's request. The only thing is that it doesn't kill the
connection unless it's already closed or in error or the stream ID reached
the one specified in GOAWAY frame. We're supposed to arm a timer to close
after some idle timeout but it's not done.
For now we have nowhere to store partial header frames so we can't
handle CONTINUATION frames and we must reject them. In this case we
respond with a stream error of type INTERNAL_ERROR.
This one sends an RST_STREAM for a given stream, using the current
demux stream ID. It's also used to send RST_STREAM for streams which
have lost their CS part (ie were aborted).
Now they really increase the window size of connections and streams.
If a stream was not queued but requested to send, it means it was
flow-controlled so it's added again into the connection's send list.
The INITIAL_WINDOW_SIZE and MAX_FRAME_SIZE settings are now extracted
from the settings frame, assigned to the connection, and attempted to
be propagated to all existing streams as per the specification. In
practice clients rarely update the settings after sending the first
stream, so the propagation will rarely be used. The ACK is properly
sent after the frame is completely parsed.
The function h2_process_demux() now tries to parse the incoming bytes
to process as many streams as possible. For now it does nothing but
dropping all incoming frames.
Instead of doing a special processing of the first SETTINGS frame, we
simply parse its header, check that it matches the expected frame type
and flags (ie no ACK), and switch to FRAME_P to parse it as any regular
frame. The regular frame parser will take care of decoding it.
An initial settings frame is emitted upon receipt of the connection
preface, which takes care of configured values. These settings are
only emitted when they differ from the protocol's default value :
- header_table_size (defaults to 4096)
- initial_window_size (defaults to 65535)
- max_concurrent_streams (defaults to unlimited)
- max_frame_size (defaults to 16384)
The max frame size is a copy of tune.bufsize. Clients will most often
reject values lower than 16384 and currently there's no trivial way to
check if H2 is going to be used at boot time.
The send() callback calls h2_process_mux() which iterates over the list
of flow controlled streams first, then streams waiting for room in the
send_list. If a stream from the send_list ends up being flow controlled,
it is then moved to the fctl_list. This way we can maintain the most
accurate fairness by ensuring that flows are always processed in order
of arrival except when they're blocked by flow control, in which case
only the other ones may pass in front of them.
It's a bit tricky as we want to remove a stream from the active lists
if it doesn't block (ie it has no reason for staying there).
If the polling update function is called with RD_ENA while H2_CF_DEM_SFULL
indicates the demux had to block on a stream buffer full condition, we can
remove the flag and re-enable polling for receiving because this is the
indication that a consumer stream has made some room in the buffer. Probably
that we should improve this to ensure that h2s->id == h2c->dsi and avoid
trying to receive multiple times in a row for the wrong stream.
A conn_stream indicates its intent to send by setting the WR_ENA flag
and calling mux->update_poll(). There's no synchronous write so the only
way to emit a response from a stream is to proceed this way. The sender
h2s is then queued into the h2c's send_list if it was not yet queued.
Once the connection is ready, it will enter its send() callback to visit
writers, calling their data->send_cb() callback to complete the operation
using mux->snd_buf().
Also we enable polling if the mux contains data and wasn't enabled. This
may happen just after a response has been transmitted using chk_snd().
It likely is incomplete for now and should probably be refined.
The H2 preface is properly detected to switch to the settings state.
It's important to note that for now we don't send out settings frame
so the operation is not complete yet.
For now it's only used to report immediate errors by announcing the
highest known stream-id on the mux's error path. The function may be
used both while processing a stream or directly in relation with the
connection. The wake() callback will automatically ask for send access
if an error is reported. The function should be usable for graceful
shutdowns as well by simply setting h2c->last_sid to the highest
acceptable stream-id (2^31-1) prior to calling the function.
A connection flag (H2_CF_GOAWAY_SENT) is set once the frame was
successfully sent. It will be usable to detect when it's safe to
close the connection.
Another flag (H2_CF_GOAWAY_FAILED) is set in case of unrecoverable
error while trying to send. It will also be used to know when it's safe
to close the connection.
The rcv_buf() callback now calls h2_process_demux() after an recv() call
leaving some data in the buffer, and the snd_buf() callback calls
h2_process_mux() to try to process pending data from streams.
If some streams were blocked on flow control and the connection's
window was recently opened, or if some streams are waiting while
no block flag remains, we immediately want to try to send again.
This can happen if a recv() for a stream wants to send after the
send() loop has already been processed.
During h2_wake(), there are various situations that can lead to the
connection being closed :
- low-level connection error
- read0 received
- fatal error (ERROR2)
- failed to emit a GOAWAY
- empty stream list with max_id >= last_sid
In such cases, all streams are notified and we have to wait for all
streams to leave while doing nothing, or if the last stream is gone,
we can simply terminate the connection.
It's important to do this test there again because an error might arise
while trying to send a pending GOAWAY after the last stream for example,
thus there's possibly no way to get notified of a closing stream.
It happens that an H2 mux is totally unusable once the client has shut,
so we must consider this situation equivalent to the connection error,
and let the possible streams drain their data if needed then stop.
Now we start to set the flags to indicate that the response buffer is
being awaited or that it is full, it makes it possible to centralize a
little bit the polling management into the wake() callback.
In case of error, we wake all the streams up so that they are aware of
the nature of the event and are able to detach if needed.
Flag H2_CF_DEM_DALLOC is set when the demux buffer fails to be allocated
in the recv() callback, and is cleared when it succeeds.
Both flags H2_CF_MUX_MALLOC and H2_CF_DEM_MROOM are cleared when the mux
buffer allocation succeeds.
In both cases it will be up to the callers to report allocation failures.
This one will be used by the HEADERS frame handler and maybe later by
the PUSH frame handler. It creates a conn_stream in the mux's connection.
The create streams are inserted in the h2c's tree sorted by IDs. The
caller is expected to have verified that the stream doesn't exist yet.
It will be more convenient to always manipulate existing streams than
null pointers. Here we create one idle stream and one closed stream.
The idea is that we can easily point any stream to one of these states
in order to merge maintenance operations.
Functions h2_get_buf_n{16,32,64}() and h2_get_buf_bytes() respectively
extract a network-ordered 16/32/64 bit value from a possibly wrapping
buffer, or any arbitrary size. They're convenient to retrieve a PING
payload or to parse SETTINGS frames. Since they copy one byte at a time,
they will be less efficient than a memcpy-based implementation on large
blocks.
This function extracts the next frame header but doesn't consume it.
This will allow to detect a stream-id change and to perform a yielding
window update without losing information. The result is stored into a
temporary frame descriptor. We could also store the next frame header
into the connection but parsing the header again is much cheaper than
wasting bytes in the connection for a rare use case.
A function (h2_skip_frame_hdr()) is also provided to skip the parsed
header (always 9 bytes) and another one (h2_get_frame_hdr()) to do both
at once.
This function is called after preparing a frame, in order to update the
frame's size in the frame header. It takes the frame payload length in
argument.
It simply writes a 24-bit frame size into a buffer, making use of the
net_helper functions which try to optimize per platform (this is a
frequently used operation).
This one will store the error into the stream's errcode if it's neither
idle nor closed (since these ones are read-only) and switch its state to
H2_SS_ERROR. If a conn_stream is attached, it will be flagged with
CS_FL_ERROR.
A mux is busy when any stream id >= 0 is currently being handled
and the current stream's id doesn't match. When no stream is
involved (ie: demuxer), stream 0 is considered. This will be
necessary to know when it's possible to send frames.
A demux may be prevented from receiving for the following reasons :
- no receive buffer could be allocated
- the receive buffer is full
- a response is needed and the mux is currently being used by a stream
- a response is needed and some room could not be found in the mux
buffer (either full or waiting for allocation)
- the stream buffer is waiting for allocation
- the stream buffer is full
A mux may stop accepting data for the following reasons :
- the buffer could not be allocated
- the buffer is full
A stream may stop sending data to a mux for the following reaons :
- the mux is busy processing another stream
- the mux buffer lacks room (full or not allocated)
- the mux's flow control prevents from sending
- the stream's flow control prevents from sending
All these conditions were turned into flags for use by the respective
places.
The idea is that we may need a mux buffer for anything, ranging from
receiving to sending traffic. For now it's unclear where exactly the
calls will be placed so let's block both send and recv when a buffer
is missing, and re-enable both of them at the end. This will have to
be changed later.
This patch implements a very basic Rx buffer management. The mux needs
an rx buffer to decode the connection's stream. If this buffer it
available upon Rx events, we fill it with whatever input data are
available. Otherwise we try to allocate it and subscribe to the buffer
wait queue in case of failure. In such a situation, a function
"h2_dbuf_available()" will be called once a buffer may be allocated.
The buffer is released if it's still empty after recv().
The connection's h2c context is now allocated and initialized on mux
initialization, and released on mux destruction. Note that for now the
release() code is never called.
We need to deal with stream error notifications (RST_STREAM) as well as
internal reporting. The problem is that we don't know in which order
this will be done so we can't unilaterally decide to deallocate the
stream. In order to help, we add two extra stream states, H2_SS_ERROR
and H2_SS_RESET. The former mentions that the stream has an error pending
and the latter indicates that the error was already sent and that the
stream is now closed. It's equivalent to H2_SS_CLOSED except that in this
state we'll avoid sending new RST_STREAM as per RFC7540#5.4.2.
With this it will be possible to only detach or deallocate the h2s once
the stream is closed.
This describes an HTTP/2 stream with its relation to the connection
and to the conn_stream on the other side.
For now we also allocate request and response state for HTTP/1 because
the internal HTTP representation is HTTP/1 at the moment. Later this
should evolve towards a version-agnostic representation and this H1
message state will disappear.
It's important to consider that the streams are necessarily polarized
depending on h2c : if the connection is incoming, streams initiated by
the connection receive requests and send responses. Otherwise it's the
other way around. Such information is known during the connection
instanciation by h2c_frt_init() and will normally be reflected in the
stream ID (odd=demux from client, even=demux from server). The initial
H2_CS_PREFACE state will also depend on the direction. The current h2c
state machine doesn't allow for outgoing connections as it uses a single
state for both (rx state only). It should be the demux state only.
The h2c struct describes an H2 connection context and is assigned as the
mux's context. It has its own pool, allocated at boot time and released
after deinit().
For now it only supports literals and a bit of static header table
references for the 9 most common header field names (date, server,
content-type, content-length, last-modified, accept-ranges, etag,
cache-control, location).
A previous incarnation of this commit used to strip the forbidden H2
header names (connection, proxy-connection, upgrade, transfer-encoding,
keep-alive) but this is no longer the case as this filtering is irrelevant
to HPACK encoding and is specific to H2, so this will have to be done by
the caller.
It's quite not optimal but works fine enough to prepare some valid and
partially compressed responses during development.
The decoder is now fully functional. It makes use of the dynamic header
table. Dynamic header table size updates are currently ignored, as our
initially advertised value is the highest we support. Strictly speaking,
the impact is that a client referencing a header field after such an
update wouldn't observe an error instead of the connection being dropped
if it was implemented.
Decoded header fields are copied into a target buffer in HTTP/1 format
using HTTP/1.1 as the version. The Host header field is automatically
appended if a ":authority" header field is present.
All decoded header fields can be displayed if the file is compiled with
DEBUG_HPACK.
This code deals with header insertion, retrieval and eviction, as well
as with dynamic header table defragmentation. It is functional for use
as a decoder and was heavily tested in this context. There's still some
room for optimization (eg: the defragmentation code currently does it
in place using a memcpy).
Also for now the dynamic header table is allocated using malloc() while
a pool needs to be created instead.
This code was mostly imported from https://github.com/wtarreau/http2-exp
with "hpack_" prepended in front of most names to avoid risks of conflicts.
Some small cleanups and renamings were applied during the import. This
version must be considered more recent.
Some HPACK error codes were placed here (HPACK_ERR_*), not exactly because
they're needed by the decoder but they'll be needed by all callers. Maybe
a different location should be found.
The code was borrowed from the HPACK experimental implementations
available here :
https://github.com/wtarreau/http2-exp
It contains the Huffman table as specified in RFC7541 Appendix B, and a
set of reverse tables used to decode a Huffman byte stream, and produced
by contrib/h2/gen-rht. The encoder is not finalized, it doesn't emit the
byte stream but this is not needed for now.
Now we don't remove the session when a stream dies, instead we
detach the stream and let the mux decide to release the connection
and call session_free() instead.
Since multiple streams can share one session attached to one listener,
the listener_release() call must be done in session_free() and not in
stream_free(), otherwise we end up with a negative count in H2.
This callback will be used to release upper layers when a mux is in
use. Given that the mux can be asynchronously deleted, we need a way
to release the extra information such as the session.
This callback will be called directly by the mux upon releasing
everything and before the connection itself is released, so that
the callee can find its information inside the connection if needed.
The way it currently works is not perfect, and most likely this should
instead become a mux release callback, but for now we have no easy way
to add mux-specific stuff, and since there's one mux per connection,
it works fine this way.
Now that the mux will take care of closing the client connection at the
right moment, we don't need to close the client connection anymore, and
we just need to close the conn_stream.
For H2, only the mux's timeout or other conditions might cause a
release of the mux and the connection, no stream should be allowed
to kill such a shared connection. So a stream will only detach using
cs_destroy() which will call mux->detach() then free the cs.
For now it's only handled by mux_pt. The goal is that the data layer
never has to care about the connection, which will have to be released
depending on the mux's mood.
At all call places where a conn_stream is in use, we can now use
cs_close() to get rid of a conn_stream and of its underlying connection
if the mux estimates it makes sense. This is what is currently being
done for the pass-through mux.
Now these functions are able to automatically close both the transport
and the socket layer, causing the whole connection to be torn down if
needed.
The two shutdown modes are implemented for both directions, and when
a direction is closed, if it sees the other one is closed as well, it
completes by closing the connection. This is similar to what is performed
in the stream interface.
It's not deployed yet but the purpose is to get rid of conn_full_close()
where only conn_stream should be known.
Instead of having to manually handle lingering outside, let's make
conn_sock_shutw() check for it before calling shutdown(). We simply
don't want to emit the FIN if we're going to reset the connection
due to lingering. It's particularly important for silent-drop where
it's absolutely mandatory that no packet leaves the machine.
In a 1:1 connection:stream there's no problem relying on the connection
flags alone to check for errors. But in a mux, it will be possible to mark
certain streams in error without having to mark all of them. An example is
an H2 client sending RST_STREAM frames to abort a long download, or a parse
error requiring to abort only this specific stream.
This commit ensures that stream-interface and checks properly check for
CS_FL_ERROR in cs->flags wherever CO_FL_ERROR was in use. Most likely over
the long term, any check for CO_FL_ERROR will have to disappear.
All the references to connections in the data path from streams and
stream_interfaces were changed to use conn_streams. Most functions named
"something_conn" were renamed to "something_cs" for this. Sometimes the
connection still is what matters (eg during a connection establishment)
and were not always renamed. The change is significant and minimal at the
same time, and was quite thoroughly tested now. As of this patch, all
accesses to the connection from upper layers go through the pass-through
mux.
This patch introduces a new struct conn_stream. It's the stream-side of
a multiplexed connection. A pool is created and destroyed on exit. For
now the conn_streams are not used at all.
A new sample fetch function reports either 1 or 2 for the on-wire encoding,
to indicate if the request was received using the HTTP/1.x format or HTTP/2
format. Note that it reports the on-wire encoding, not the version presented
in the request header.
This will possibly have to evolve if it becomes necessary to report the
encoding on the server side as well.
When an incoming connection is made on an HTTP mode frontend, the
session now looks up the mux to use based on the ALPN token and the
proxy mode. This will allow easier mux registration, and we don't
need to hard-code the mux_pt_ops anymore.
The pass-through mux is the fallback used on any incoming connection
unless another mux claims the ALPN name and the proxy mode. Thus mux_pt
registers ALPN token "" (empty name) which catches everything.
Selecting a mux based on ALPN and the proxy mode will quickly become a
pain. This commit provides new functions to register/lookup a mux based
on the ALPN string and the proxy mode to make this easier. Given that
we're not supposed to support a wide range of muxes, the lookup should
not have any measurable performance impact.
For HTTP/2 and QUIC, we'll need to deal with multiplexed streams inside
a connection. After quite a long brainstorming, it appears that the
connection interface to the existing streams is appropriate just like
the connection interface to the lower layers. In fact we need to have
the mux layer in the middle of the connection, between the transport
and the data layer.
A mux can exist on two directions/sides. On the inbound direction, it
instanciates new streams from incoming connections, while on the outbound
direction it muxes streams into outgoing connections. The difference is
visible on the mux->init() call : in one case, an upper context is already
known (outgoing connection), and in the other case, the upper context is
not yet known (incoming connection) and will have to be allocated by the
mux. The session doesn't have to create the new streams anymore, as this
is performed by the mux itself.
This patch introduces this and creates a pass-through mux called
"mux_pt" which is used for all new connections and which only
calls the data layer's recv,send,wake() calls. One incoming stream
is immediately created when init() is called on the inbound direction.
There should not be any visible impact.
Note that the connection's mux is purposely not set until the session
is completed so that we don't accidently run with the wrong mux. This
must not cause any issue as the xprt_done_cb function is always called
prior to using mux's recv/send functions.
commit 6e01286 (BUG/MAJOR: threads/freq_ctr: fix lock on freq counters)
attempted to fix the loop using volatile but that doesn't work depending
on the level of optimization, resulting in situations where the threads
could remain looping forever. Here we use memory barriers between reads
to enforce a strict ordering and the asm code produced does exactly what
the C code does and works perfectly, with a 3-digit measurement accuracy
observed during a test.
This is needed in the H2->H1 gateway so that we know how long the trailers
block is in chunked encoding. It returns the number of bytes, or 0 if some
are missing, or -1 in case of parse error.
It was a leftover from the last cleaning session; this mask applies
to threads and calling it process_mask is a bit confusing. It's the
same in fd, task and applets.
srv_set_fqdn() may be called with the DNS lock already held, but tries to
lock it anyway. So, add a new parameter to let it know if it was already
locked or not;
In function tv_update_date, we keep an offset reprenting the time deviation to
adjust the system time. At every call, we check if this offset must be updated
or not. Of course, It must be shared by all threads. It was store in a
timeval. But it cannot be atomically updated. So now, instead, we store it in a
64-bits integer. And in tv_update_date, we convert this integer in a
timeval. Once updated, it is converted back in an integer to be atomically
stored.
To store a tv_offset into an integer, we use 32 bits from tv_sec and 32 bits
tv_usec to avoid shift operations.
The wrong bit was set to keep the lock on freq counter update. And the read
functions were re-worked to use volatile.
Moreover, when a freq counter is updated, it is now rotated only if the current
counter is in the past (now.tv_sec > ctr->curr_sec). It is important with
threads because the current time (now) is thread-local. So, rounded to the
second, the time may vary by more or less 1 second. So a freq counter rotated by
one thread may be see 1 second in the future. In this case, it is updated but
not rotated.
There was a flaw in the way the threads was created. the main one was just used
to create all the others and just wait to exit. Now, it is used to run a poll
loop. So we only create nbthread-1 threads.
This also fixes a bug about the compression filter when there is only 1 thread
(nbthread == 1 or no threads support). The bug was in the way thread-local
resources was initialized. per-thread init/deinit callbacks were never called
for the main process. So, with nthread set to 1, some buffers remained
uninitialized.
For now, we don't know if device detection modules (51degrees, deviceatlas and
wurfl) are thread-safe or not. So HAproxy exits with an error when you try to
use one of them with nbthread greater than 1.
We will ask to maintainers of these modules to make them thread-safe or to give
us hints to do so.
Tasks used to process checks are created to be processed by any threads. But,
once a check is started, we must be sure to be sticky on the running thread
because I/O will be also sticky on it. This is a requirement for now: Tasks and
I/O handlers linked to the same session must be executed on the same thread.
By default, no affinity is set for threads. To bind threads on CPU, you must
define a "thread-map" in the global section. The format is the same than the
"cpu-map" parameter, with a small difference. The process number must be
defined, with the same format than cpu-map ("all", "even", "odd" or a number
between 1 and 31/63).
A thread will be bound on the intersection of its mapping and the one of the
process on which it is attached. If the intersection is null, no specific bind
will be set for the thread.
Because there is not migration mechanism yet, all runtime information about an
SPOE agent are thread-local and async exchanges with agents are disabled when we
have serveral threads. Howerver, pipelining is still available. So for now, the
thread part of the SPOE is pretty simple.
We have two y for nsuring that the data is not concurently manipulated:
- locks
- running task on the same thread.
locks are expensives, it is better to avoid it.
This patch cecks that the Lua task run on the same thread that
the stream associated to the coprocess.
TODO: in a next version, the error should be replaced by a yield
and thread migration request.
The applet manipulates the session and its buffers. We have two methods for
ensuring that the memory of the session will not change during its manipulation
by the task:
1 - adding mutex
2 - running on the same threads than the task.
The second point is smart because it cannot lock the execution of another thread.
Note that the Lua processing is not really thread safe. It provides
heavy system which consists to add our own lock function in the Lua
code and recompile the library. This system will probably not accepted
by maintainers of various distribs.
Our main excution point of the Lua is the function lua_resume(). A
quick looking on the Lua sources displays a lua_lock() a the start
of function and a lua_unlock() at the end of the function. So I
conclude that the Lua thread safe mode just perform a mutex around
all execution. So I prefer to do this in the HAProxy code, it will be
easier for distro maintainers.
Note that the HAProxy lua functions rounded by the macro SET_SAFE_LJMP
and RESET_SAFE_LJMP manipulates the Lua stack, so it will be careful
to set mutex around these functions.
The jmpbuf contains pointer on the stack memory address currently use
when the jmpbuf is set. So the information is local to each thread.
The struct field is too big to put it in the stack, but it is used
as buffer for retriving stats values. So, this buffer si local to each
threads. Each function using this buffer, use it whithout break (yield)
so, the consistency of local buffer is ensured.
Now, it is possible to define init_per_thread and deinit_per_thread callbacks to
deal with ressources allocation for each thread.
This is the filter responsibility to deal with concurrency. This is also the
filter responsibility to know if HAProxy is started with some threads. A good
way to do so is to check "global.nbthread" value. If it is greater than 1, then
_per_thread callbacks will be called.
A RW lock has been added to the vars structure to protect each list of
variables. And a global RW lock is used to protect registered names.
When a varibable is fetched, we duplicate sample data because the variable could
be modified by another thread.
When a frequency counter must be updated, we use the curr_sec/curr_tick fields
as a lock, by setting the MSB to 1 in a compare-and-swap to lock and by reseting
it to unlock. And when we need to read it, we loop until the counter is
unlocked. This way, the frequency counters are thread-safe without any external
lock. It is important to avoid increasing the size of many structures (global,
proxy, server, stick_table).
locks have been added in pat_ref and pattern_expr structures to protect all
accesses to an instance of on of them. Moreover, a global lock has been added to
protect the LRU cache used for pattern matching.
Patterns are now duplicated after a successfull matching, to avoid modification
by other threads when the result is used.
Finally, the function reloading a pattern list has been modified to be
thread-safe.
First, OpenSSL is now initialized to be thread-safe. This is done by setting 2
callbacks. The first one is ssl_locking_function. It handles the locks and
unlocks. The second one is ssl_id_function. It returns the current thread
id. During the init step, we create as much as R/W locks as needed, ie the
number returned by CRYPTO_num_locks function.
Next, The reusable SSL session in the server context is now thread-local.
Shctx is now also initialized if HAProxy is started with several threads.
And finally, a global lock has been added to protect the LRU cache used to store
generated certificates. The function ssl_sock_get_generated_cert is now
deprecated because the retrieved certificate can be removed by another threads
in same time. Instead, a new function has been added,
ssl_sock_assign_generated_cert. It must be used to search a certificate in the
cache and set it immediatly if found.
A lock is used to protect accesses to a peer structure.
A the lock is taken in the applet handler when the peer is identified
and released living the applet handler.
In the scheduling task for peers section, the lock is taken for every
listed peer and released at the end of the process task function.
The peer 'force shutdown' function was also re-worked.
A global lock has been added to protect accesses to the list of active
applets. A process mask has also been added on each applet. Like for FDs and
tasks, it is used to know which threads are allowed to process an
applet. Because applets are, most of time, linked to a session, it should be
sticky on the same thread. But in all cases, it is the responsibility of the
applet handler to lock what have to be protected in the applet context.
This is done by passing the right stream's proxy (the frontend or the backend,
depending on the context) to lock the error snapshot used to store the error
info.
The stick table API was slightly reworked:
A global spin lock on stick table was added to perform lookup and
insert in a thread safe way. The handling of refcount on entries
is now handled directly by stick tables functions under protection
of this lock and was removed from the code of callers.
The "stktable_store" function is no more externalized and users should
now use "stktable_set_entry" in any case of insertion. This last one performs
a lookup followed by a store if not found. So the code using "stktable_store"
was re-worked.
Lookup, and set_entry functions automatically increase the refcount
of the returned/stored entry.
The function "sticktable_touch" was renamed "sticktable_touch_local"
and is now able to decrease the refcount if last arg is set to true. It
is allowing to release the entry without taking the lock twice.
A new function "sticktable_touch_remote" is now used to insert
entries coming from remote peers at the right place in the update tree.
The code of peer update was re-worked to use this new function.
This function is also able to decrease the refcount if wanted.
The function "stksess_kill" also handle a parameter to decrease
the refcount on the entry.
A read/write lock is added on each entry to protect the data content
updates of the entry.
A lock for LB parameters has been added inside the proxy structure and atomic
operations have been used to update server variables releated to lb.
The only significant change is about lb_map. Because the servers status are
updated in the sync-point, we can call recalc_server_map function synchronously
in map_set_server_status_up/down function.
This list is used to save changes on the servers state. So when serveral threads
are used, it must be locked. The changes are then applied in the sync-point. To
do so, servers_update_status has be moved in the sync-point. So this is useless
to lock it at this step because the sync-point is a protected area by iteself.
For now, we have a list of each type per thread. So there is no need to lock
them. This is the easiest solution for now, but not the best one because there
is no sharing between threads. An idle connection on a thread will not be able
be used by a stream on another thread. So it could be a good idea to rework this
patch later.
Now, each proxy contains a lock that must be used when necessary to protect
it. Moreover, all proxy's counters are now updated using atomic operations.
First, we use atomic operations to update jobs/totalconn/actconn variables,
listener's nbconn variable and listener's counters. Then we add a lock on
listeners to protect access to their information. And finally, listener queues
(global and per proxy) are also protected by a lock. Here, because access to
these queues are unusal, we use the same lock for all queues instead of a global
one for the global queue and a lock per proxy for others.
2 global locks have been added to protect, respectively, the run queue and the
wait queue. And a process mask has been added on each task. Like for FDs, this
mask is used to know which threads are allowed to process a task.
For many tasks, all threads are granted. And this must be your first intension
when you create a new task, else you have a good reason to make a task sticky on
some threads. This is then the responsibility to the process callback to lock
what have to be locked in the task context.
Nevertheless, all tasks linked to a session must be sticky on the thread
creating the session. It is important that I/O handlers processing session FDs
and these tasks run on the same thread to avoid conflicts.
Many changes have been made to do so. First, the fd_updt array, where all
pending FDs for polling are stored, is now a thread-local array. Then 3 locks
have been added to protect, respectively, the fdtab array, the fd_cache array
and poll information. In addition, a lock for each entry in the fdtab array has
been added to protect all accesses to a specific FD or its information.
For pollers, according to the poller, the way to manage the concurrency is
different. There is a poller loop on each thread. So the set of monitored FDs
may need to be protected. epoll and kqueue are thread-safe per-se, so there few
things to do to protect these pollers. This is not possible with select and
poll, so there is no sharing between the threads. The poller on each thread is
independant from others.
Finally, per-thread init/deinit functions are used for each pollers and for FD
part for manage thread-local ressources.
Now, you must be carefull when a FD is created during the HAProxy startup. All
update on the FD state must be made in the threads context and never before
their creation. This is mandatory because fd_updt array is thread-local and
initialized only for threads. Because there is no pollers for the main one, this
array remains uninitialized in this context. For this reason, listeners are now
enabled in run_thread_poll_loop function, just like the worker pipe.
log buffers and static variables used in log functions are now thread-local. So
there is no need to lock anything to log messages. Moreover, per-thread
init/deinit functions are now used to initialize these buffers.
The function sync_poll_loop is called at the end of each loop inside
run_poll_loop function. It is a protected area where all threads have a chance
to execute tricky tasks with the warranty that no concurrent access is
possible. Of course, it comes with a cost because all threads must be
syncrhonized. So changes must be uncommon.
[WARNING] For now, HAProxy is not thread-safe, so from this commit, it will be
broken for a while, when compiled with threads.
When nbthread parameter is greater than 1, HAProxy will create the corresponding
number of threads. If nbthread is set to 1, nothing should be done. So if there
are concurrency issues (and be sure there will be, unfortunatly), an obvious
workaround is to disable the multithreading...
Each created threads will run a polling loop. So, in a certain way, it is pretty
similar to the nbproc mode ("outside" the bugs and the lock
contention). Nevertheless, there are an init and a deinit steps for each thread
to deal with per-thread allocation.
Each thread has a tid (thread-id), numbered from 0 to (nbtread-1). It is used in
many place to do bitwise operations or to improve debugging information.
A sync-point is a protected area where you have the warranty that no concurrency
access is possible. It is implementated as a thread barrier to enter in the
sync-point and another one to exit from it. Inside the sync-point, all threads
that must do some syncrhonous processing will be called one after the other
while all other threads will wait. All threads will then exit from the
sync-point at the same time.
A sync-point will be evaluated only when necessary because it is a costly
operation. To limit the waiting time of each threads, we must have a mechanism
to wakeup all threads. This is done with a pipe shared by all threads. By
writting in this pipe, we will interrupt all threads blocked on a poller. The
pipe is then flushed before exiting from the sync-point.
hap_register_per_thread_init and hap_register_per_thread_deinit functions has
been added to register functions to do, for each thread, respectively, some
initialization and deinitialization. These functions are added in the global
lists per_thread_init_list and per_thread_deinit_list.
These functions are called only when HAProxy is started with more than 1 thread
(global.nbthread > 1).
This file contains all functions and macros used to deal with concurrency in
HAProxy. It contains all high-level function to do atomic operation
(HA_ATOMIC_*). Note, for now, we rely on "__atomic" GCC builtins to do atomic
operation. So HAProxy can be compiled with the thread support iff these builtins
are available.
It also contains wrappers around plocks to use spin or read/write locks. These
wrappers are used to abstract the internal representation of the locking system
and to add information to help debugging, when compiled with suitable
options.
To add extra info on locks, you need to add DEBUG=-DDEBUG_THREAD or
DEBUG=-DDEBUG_FULL compilation option. In addition to timing info on locks, we
keep info on where a lock was acquired the last time (function name, file and
line). There are also the thread id and a flag to know if it is still locked or
not. This will be useful to debug deadlocks.
Because we can't always display the standard error messages when HAProxy is
started, all alerts and warnings emitted during the startup will now be saved in
a buffer. It can also be handy to store these messages just in case you
missed something during the startup
To implement this feature, Alert and Warning functions now relies on
display_message. The difference is just on conditions to call this function and
it remains unchanged. In display_message, if MODE_STARTING flag is set, we save
the message.
Now memprintf relies on memvprintf. This new function does exactly what
memprintf did before, but it must be called with a va_list instead of a variable
number of arguments. So there is no change for every functions using
memprintf. But it is now also possible to have same functionnality from any
function with variadic arguments.
Email alerts relies on checks to send emails. The link between a mailers section
and a proxy was resolved during the configuration parsing, But initialization was
done when the first alert is triggered. This implied memory allocations and
tasks creations. With this patch, everything is now initialized during the
configuration parsing. So when an alert is triggered, only the memory required
by this alert is dynamically allocated.
Moreover, alerts processing had a flaw. The task handler used to process alerts
to be sent to the same mailer, process_email_alert, was designed to give back
the control to the scheduler when an alert was sent. So there was a delay
between the sending of 2 consecutives alerts (the min of
"proxy->timeout.connect" and "mailer->timeout.mail"). To fix this problem, now,
we try to process as much queued alerts as possible when the task is woken up.
An email alert contains a list of tcpcheck_rule. Each one is dynamically
allocated, just like its internal members. So, when an email alerts is freed, we
must be sure to properly free each tcpcheck_rule too.
This patch must be backported in 1.7 and 1.6.
This is a huge patch with many changes, all about the DNS. Initially, the idea
was to update the DNS part to ease the threads support integration. But quickly,
I started to refactor some parts. And after several iterations, it was
impossible for me to commit the different parts atomically. So, instead of
adding tens of patches, often reworking the same parts, it was easier to merge
all my changes in a uniq patch. Here are all changes made on the DNS.
First, the DNS initialization has been refactored. The DNS configuration parsing
remains untouched, in cfgparse.c. But all checks have been moved in a post-check
callback. In the function dns_finalize_config, for each resolvers, the
nameservers configuration is tested and the task used to manage DNS resolutions
is created. The links between the backend's servers and the resolvers are also
created at this step. Here no connection are kept alive. So there is no needs
anymore to reopen them after HAProxy fork. Connections used to send DNS queries
will be opened on demand.
Then, the way DNS requesters are linked to a DNS resolution has been
reworked. The resolution used by a requester is now referenced into the
dns_requester structure and the resolution pointers in server and dns_srvrq
structures have been removed. wait and curr list of requesters, for a DNS
resolution, have been replaced by a uniq list. And Finally, the way a requester
is removed from a DNS resolution has been simplified. Now everything is done in
dns_unlink_resolution.
srv_set_fqdn function has been simplified. Now, there is only 1 way to set the
server's FQDN, independently it is done by the CLI or when a SRV record is
resolved.
The static DNS resolutions pool has been replaced by a dynamoc pool. The part
has been modified by Baptiste Assmann.
The way the DNS resolutions are triggered by the task or by a health-check has
been totally refactored. Now, all timeouts are respected. Especially
hold.valid. The default frequency to wake up a resolvers is now configurable
using "timeout resolve" parameter.
Now, as documented, as long as invalid repsonses are received, we really wait
all name servers responses before retrying.
As far as possible, resources allocated during DNS configuration parsing are
releases when HAProxy is shutdown.
Beside all these changes, the code has been cleaned to ease code review and the
doc has been updated.
The cli command to show resolvers stats is in conflict with the command to show
proxies and servers stats. When you use the command "show stat resolvers [id]",
instead of printing stats about resolvers, you get the stats about all proxies
and servers.
Now, to avoid conflict, to print resolvers stats, you must use the following
command:
show resolvers [id]
This patch must be backported in 1.7.
The messages processing is done using existing functions. So here, the main task
is to find the SPOE engine to use. To do so, we loop on all filter instances
attached to the stream. For each, we check if it is a SPOE filter and, if yes,
if its name is the one used to declare the "send-spoe-group" action.
We also take care to return an error if the action processing is interrupted by
HAProxy (because of a timeout or an error at the HAProxy level). This is done by
checking if the flag ACT_FLAG_FINAL is set.
The function spoe_send_group is the action_ptr callback ot
Because we can have messages chained by event or by group, we need to have a way
to know which kind of list we manipulate during the encoding. So 2 types of list
has been added, SPOE_MSGS_BY_EVENT and SPOE_MSGS_BY_GROUP. And the right type is
passed when spoe_encode_messages is called.
This action is used to trigger sending of a group of SPOE messages. To do so,
the SPOE engine used to send messages must be defined, as well as the SPOE group
to send. Of course, the SPOE engine must refer to an existing SPOE filter. If
not engine name is provided on the SPOE filter line, the SPOE agent name must be
used. For example:
http-request send-spoe-group my-engine some-group
This action is available for "tcp-request content", "tcp-response content",
"http-request" and "http-response" rulesets. It cannot be used for tcp
connection/session rulesets because actions for these rulesets cannot yield.
For now, the action keyword is parsed and checked. But it does nothing. Its
processing will be added in another patch.
For now, this section is only parsed. It should have the following format:
spoe-group <grp-name>
messages <msg-name> ...
And then SPOE groups must be referenced in spoe-agent section:
spoe-agnt <name>
...
groups <grp-name> ...
The purpose of these groups is to trigger messages sending from TCP or HTTP
rules, directly from HAProxy configuration, and not on specific event. This part
will be added in another patch.
It is important to note that a message belongs at most to a group.
The engine name is now kept in "spoe_config" struture. Because a SPOE filter can
be declared without engine name, we use the SPOE agent name by default. Then,
its uniqness is checked against all others SPOE engines configured for the same
proxy.
* TODO: Add documentation
Now, it is possible to conditionnaly send a SPOE message by adding an ACL-based
condition on the "event" line, in a "spoe-message" section. Here is the example
coming for the SPOE documentation:
spoe-message get-ip-reputation
args ip=src
event on-client-session if ! { src -f /etc/haproxy/whitelist.lst }
To avoid mixin with proxy's ACLs, each SPOE message has its private ACL list. It
possible to declare named ACLs in "spoe-message" section, using the same syntax
than for proxies. So we can rewrite the previous example to use a named ACL:
spoe-message get-ip-reputation
args ip=src
acl ip-whitelisted src -f /etc/haproxy/whitelist.lst
event on-client-session if ! ip-whitelisted
ACL-based conditions are executed in the context of the stream that handle the
client and the server connections.
"check_http_req_capture" and "check_http_res_capture" functions have been added
to check validity of "http-request capture" and "http-response capture"
rules. Code for these functions come from cfgparse.c.
SPOE filter can be declared without engine name. This is an optional
parameter. But in this case, no scope must be used in the SPOE configuration
file. So engine name and scope are both undefined, and, obviously, we must not
try to compare them.
This patch must be backported in 1.7.
It was painful not to have the status code available, especially when
it was computed. Let's store it and ensure we don't claim content-length
anymore on 1xx, only 0 body bytes.
This patch reorganize the shctx API in a generic storage API, separating
the shared SSL session handling from its core.
The shctx API only handles the generic data part, it does not know what
kind of data you use with it.
A shared_context is a storage structure allocated in a shared memory,
allowing its usage in a multithread or a multiprocess context.
The structure use 2 linked list, one containing the available blocks,
and another for the hot locked blocks. At initialization the available
list is filled with <maxblocks> blocks of size <blocksize>. An <extra>
space is initialized outside the list in case you need some specific
storage.
+-----------------------+--------+--------+--------+--------+----
| struct shared_context | extra | block1 | block2 | block3 | ...
+-----------------------+--------+--------+--------+--------+----
<-------- maxblocks --------->
* blocksize
The API allows to store content on several linked blocks. For example,
if you allocated blocks of 16 bytes, and you want to store an object of
60 bytes, the object will be allocated in a row of 4 blocks.
The API was made for LRU usage, each time you get an object, it pushes
the object at the end of the list. When it needs more space, it discards
The functions name have been renamed in a more logical way, the part
regarding shctx have been prefixed by shctx_ and the functions for the
shared ssl session cache have been prefixed by sh_ssl_sess_.
Move the ssl callback functions of the ssl shared session cache to
ssl_sock.c. The shctx functions still needs to be separated of the ssl
tree and data.
It's important for the H2 to H1 gateway that the response parser properly
clears the H1 message's body_len when seeing these status codes so that we
don't hang waiting to transfer data that will not come.
A bind_conf does contain a ssl_bind_conf, which already has a flag to know
if early data are activated, so use that, instead of adding a new flag in
the ssl_options field.
Add a new sample fetch, "ssl_fc_has_early", a boolean that will be true
if early data were sent, and a new action, "wait-for-handshake", if used,
the request won't be forwarded until the SSL handshake is done.
When compiled with Openssl >= 1.1.1, before attempting to do the handshake,
try to read any early data. If any early data is present, then we'll create
the session, read the data, and handle the request before we're doing the
handshake.
For this, we add a new connection flag, CO_FL_EARLY_SSL_HS, which is not
part of the CO_FL_HANDSHAKE set, allowing to proceed with a session even
before an SSL handshake is completed.
As early data do have security implication, we let the origin server know
the request comes from early data by adding the "Early-Data" header, as
specified in this draft from the HTTP working group :
https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-replay
Use Openssl-1.1.1 SSL_CTX_set_client_hello_cb to mimic BoringSSL early callback.
Native multi certificate and SSL/TLS method per certificate is now supported by
Openssl >= 1.1.1.
switchctx early callback is only supported for BoringSSL. To prepare
the support of openssl 1.1.1 early callback, convert CBS api to neutral
code to work with any ssl libs.
This patch simply brings HAProxy internal regex system to the Lua API.
Lua doesn't embed regexes, now it inherits from the regexes compiled
with haproxy.
Allow to register a function which will be called after the
configuration file parsing, at the end of the check_config_validity().
It's useful fo checking dependencies between sections or for resolving
keywords, pointers or values.
This commit implements a post section callback. This callback will be
used at the end of a section parsing.
Every call to cfg_register_section must be modified to use the new
prototype:
int cfg_register_section(char *section_name,
int (*section_parser)(const char *, int, char **, int),
int (*post_section_parser)());
Calls to build_logline() are audited in order to use dynamic trash buffers
allocated by alloc_trash_chunk() instead of global trash buffers.
This is similar to commits 07a0fec ("BUG/MEDIUM: http: Prevent
replace-header from overwriting a buffer") and 0d94576 ("BUG/MEDIUM: http:
prevent redirect from overwriting a buffer").
This patch should be backported in 1.7, 1.6 and 1.5. It relies on commit
b686afd ("MINOR: chunks: implement a simple dynamic allocator for trash
buffers") for the trash allocator, which has to be backported as well.
When switching the check code to a non-permanent connection, the new code
forgot to free the connection if an error happened and was returned by
connect_conn_chk(), leading to the check never be ran again.
Now when ssl_sock_{to,from}_buf are called, if the connection doesn't
feature CO_FL_WILL_UPDATE, they will first retrieve the updated flags
using conn_refresh_polling_flags() before changing any flag, then call
conn_cond_update_sock_polling() before leaving, to commit such changes.
Now when raw_sock_{to,from}_{pipe,buf} are called, if the connection
doesn't feature CO_FL_WILL_UPDATE, they will first retrieve the updated
flags using conn_refresh_polling_flags() before changing any flag, then
call conn_cond_update_sock_polling() before leaving, to commit such
changes. Note that the only real call to one of the __conn_* functions
is in fact in conn_sock_read0() which is called from here.
In transport-layer functions (snd_buf/rcv_buf), it's very problematic
never to know if polling changes made to the connection will be propagated
or not. This has led to some conn_cond_update_polling() calls being placed
at a few places to cover both the cases where the function is called from
the upper layer and when it's called from the lower layer. With the arrival
of the MUX, this becomes even more complicated, as the upper layer will not
have to manipulate anything from the connection layer directly and will not
have to push such updates directly either. But the snd_buf functions will
need to see their updates committed when called from upper layers.
The solution here is to introduce a connection flag set by the connection
handler (and possibly any other similar place) indicating that the caller
is committed to applying such changes on return. This way, the called
functions will be able to apply such changes by themselves before leaving
when the flag is not set, and the upper layer will not have to care about
that anymore.
SSL records are 16kB max. When trying to send larger data chunks at once,
SSL_read() only processes 16kB and ssl_sock_from_buf() believes it means
the system buffers are full, which is not the case, contrary to raw_sock.
This is particularly noticeable with HTTP/2 when using a 64kB buffer with
multiple streams, as the mux buffer can start to fill up pretty quickly
in this situation, slowing down the data delivery.
We've been keep this test for a connection being established since 1.5-dev14
when the stream-interface was still accessing the FD directly. The test on
CO_FL_HANDSHAKE and L{4,6}_CONN is totally useless here, and can even be
counter-productive on pure TCP where it could prevent a request from being
sent on a connection still attempting to complete its establishment. And it
creates an abnormal dependency between the layers that will complicate the
implementation of the mux, so let's get rid of it now.
This is based on the git SHA1 implementation and optimized to do word
accesses rather than byte accesses, and to avoid unnecessary copies into
the context array.
in 32af203b75 ("REORG: cli: move ssl CLI functions to ssl_sock.c")
"set ssl tls-key" was accidentally replaced with "set ssl tls-keys"
(keys instead of key). This is undocumented and breaks upgrades from
1.6 to 1.7.
This patch restores "set ssl tls-key" and also registers a helptext.
This should be backported to 1.7.
Commit 872085ce "BUG/MINOR: ssl: ocsp response with 'revoked' status is correct"
introduce a regression. OCSP_single_get0_status can return -1 and haproxy must
generate an error in this case.
Thanks to Sander Hoentjen who have spotted the regression.
This patch should be backported in 1.7, 1.6 and 1.5 if the patch above is
backported.
BoringSSL switch OPENSSL_VERSION_NUMBER to 1.1.0 for compatibility.
Fix BoringSSL call and openssl-compat.h/#define occordingly.
This will not break openssl/libressl compat.
On x86_64, when gcc instruments functions and compiles at -O0, it saves
the function's return value in register rbx before calling the trace
callback. It provides a nice opportunity to display certain useful
values (flags, booleans etc) during trace sessions. It's absolutely
not guaranteed that it will always work but it provides a considerable
help when it does so it's worth activating it. When building on a
different architecture, the value 0 is always reported as the return
value. On x86_64 with optimizations (-O), the RBX register will not
necessarily match and random values will be reported, but since it's
not the primary target it's not a problem.
Now any call to trace() in the code will automatically appear interleaved
with the call sequence and timestamped in the trace file. They appear with
a '#' on the 3rd argument (caller's pointer) in order to make them easy to
spot. If the trace functionality is not used, a dmumy weak function is used
instead so that it doesn't require to recompile every time traces are
enabled/disabled.
The trace decoder knows how to deal with these messages, detects them and
indents them similarly to the currently traced function. This can be used
to print function arguments for example.
Note that we systematically flush the log when calling trace() to ensure we
never miss important events, so this may impact performance.
The trace() function uses the same format as printf() so it should be easy
to setup during debugging sessions.
Don't forget to allocate tmptrash before using it, and free it once we're
done.
[wt: introduced by commit 64cc49cf ("MAJOR: servers: propagate server
status changes asynchronously"), no backport needed]
ocsp_status can be 'good', 'revoked', or 'unknown'. 'revoked' status
is a correct status and should not be dropped.
In case of certificate with OCSP must-stapling extension, response with
'revoked' status must be provided as well as 'good' status.
This patch can be backported in 1.7, 1.6 and 1.5.
Instead of having to manually handle lingering outside, let's make
conn_sock_shutw() check for it before calling shutdown(). We simply
don't want to emit the FIN if we're going to reset the connection
due to lingering. It's particularly important for silent-drop where
it's absolutely mandatory that no packet leaves the machine.
These flags are not exactly for the data layer, they instead indicate
what is expected from the transport layer. Since we're going to split
the connection between the transport and the data layers to insert a
mux layer, it's important to have a clear idea of what each layer does.
All function conn_data_* used to manipulate these flags were renamed to
conn_xprt_*.
The HTTP/2->HTTP/1 gateway will need to process HTTP/1 responses. We
cannot sanely rely on the HTTP/1 txn to parse a response because :
1) responses generated by haproxy such as error messages, redirects,
stats or Lua are neither parsed nor indexed ; this could be
addressed over the long term but will take time.
2) the http txn is useless to parse the body : the states present there
are only meaningful to received bytes (ie next bytes to parse) and
not at all to sent bytes. Thus chunks cannot be followed at all.
Even when implementing this later, it's unsure whether it will be
possible when dealing with compression.
So using the HTTP txn is now out of the equation and the only remaining
solution is to call an HTTP/1 message parser. We already have one, it was
slightly modified to avoid keeping states by benefitting from the fact
that the response was produced by haproxy and this is entirely available.
It assumes the following rules are true, or that incuring an extra cost
to work around them is acceptable :
- the response buffer is read-write and supports modifications in place
- headers sent through / by haproxy are not folded. Folding is still
implemented by replacing CR/LF/tabs/spaces with spaces if encountered
- HTTP/0.9 responses are never sent by haproxy and have never been
supported at all
- haproxy will not send partial responses, the whole headers block will
be sent at once ; this means that we don't need to keep expensive
states and can afford to restart the parsing from the beginning when
facing a partial response ;
- response is contiguous (does not wrap). This was already the case
with the original parser and ensures we can safely dereference all
fields with (ptr,len)
The parser replaces all of the http_msg fields that were necessary with
local variables. The parser is not called on an http_msg but on a string
with a start and an end. The HTTP/1 states were reused for ease of use,
though the request-specific ones have not been implemented for now. The
error position and error state are supported and optional ; these ones
may be used later for bug hunting.
The parser issues the list of all the headers into a caller-allocated
array of struct ist.
The content-length/transfer-encoding header are checked and the relevant
info fed the h1 message state (flags + body_len).
The chunk crlf parser used to depend on the channel and on the HTTP
message, eventhough it's not really needed. Let's remove this dependency
so that it can be used within the H2 to H1 gateway.
As part of this small API change, it was renamed to h1_skip_chunk_crlf()
to mention that it doesn't depend on http_msg anymore.
The chunk parser used to depend on the channel and on the HTTP message
but it's not really needed as they're only used to retrieve the buffer
as well as to return the number of bytes parsed and the chunk size.
Here instead we pass the (few) relevant information in arguments so that
the function may be reused without a channel nor an HTTP message (ie
from the H2 to H1 gateway).
As part of this API change, it was renamed to h1_parse_chunk_size() to
mention that it doesn't depend on http_msg anymore.
Functions http_parse_chunk_size(), http_skip_chunk_crlf() and
http_forward_trailers() were moved to h1.h and h1.c respectively so
that they can be called from outside. The parts that were inline
remained inline as it's critical for performance (+41% perf
difference reported in an earlier test). For now the "http_" prefix
remains in their name since they still depend on the http_msg type.
Certain types and enums are very specific to the HTTP/1 parser, and we'll
need to share them with the HTTP/2 to HTTP/1 translation code. Let's move
them to h1.c/h1.h. Those with very few occurrences or only used locally
were renamed to explicitly mention the relevant HTTP version :
enum ht_state -> h1_state.
http_msg_state_str -> h1_msg_state_str
HTTP_FLG_* -> H1_FLG_*
http_char_classes -> h1_char_classes
Others like HTTP_IS_*, HTTP_MSG_* are left to be done later.
Fix regression introduced by commit:
'MAJOR: servers: propagate server status changes asynchronously.'
The building of the log line was re-worked to be done at the
postponed point without lack of data.
[wt: this only affects 1.8-dev, no backport needed]
There's no point having the channel marked writable as these functions
only extract data from the channel. The code was retrieved from their
ci/co ancestors.
For HTTP/2 we'll need some buffer-only equivalent functions to some of
the ones applying to channels and still squatting the bi_* / bo_*
namespace. Since these names have kept being misleading for quite some
time now and are really getting annoying, it's time to rename them. This
commit will use "ci/co" as the prefix (for "channel in", "channel out")
instead of "bi/bo". The following ones were renamed :
bi_getblk_nc, bi_getline_nc, bi_putblk, bi_putchr,
bo_getblk, bo_getblk_nc, bo_getline, bo_getline_nc, bo_inject,
bi_putchk, bi_putstr, bo_getchr, bo_skip, bi_swpbuf
Since commit 'MAJOR: task: task scheduler rework'
0194897e54. LUA's
scheduling tasks are freezing.
A running task should not handle the scheduling itself
but let the task scheduler to handle it based on the
'expire' field.
[wt: no backport needed]
Clear MaxSslRate, SslFrontendMaxKeyRate and SslBackendMaxKeyRate when
clear counters is used, it was probably forgotten when those counters were
added.
[wt: this can probably be backported as far as 1.5 in dumpstats.c]
When the server weight is rised using the CLI, extra nodes have to be
allocated, or the weight will be effectively the same as the original one.
[wt: given that the doc made no explicit mention about this limitation,
this patch could even be backported as it fixes an unexpected behaviour]
Since around 1.5-dev12, we've been setting MSG_MORE on send() on various
conditions, including the fact that SHUTW_NOW is present, but we don't
check that it's accompanied with AUTO_CLOSE. The result is that on requests
immediately followed by a close (where AUTO_CLOSE is not set), the request
gets delayed in the TCP stack before being sent to the server. This is
visible with the H2 code where the end-of-stream flag is set on requests,
but probably happens when a POLL_HUP is detected along with the request.
The (lack of) presence of option abortonclose has no effect here since we
never send the SHUTW along with the request.
This fix can be backported to 1.7, 1.6 and 1.5.
The hour part of the timezone offset was multiplied by 60 instead of
3600, resulting in an inaccurate expiry. This bug was introduced in
1.6-dev1 by commit 4f3c87a ("BUG/MEDIUM: ssl: Fix to not serve expired
OCSP responses."), so this fix must be backported into 1.7 and 1.6.
In order to prepare multi-thread development, code was re-worked
to propagate changes asynchronoulsy.
Servers with pending status changes are registered in a list
and this one is processed and emptied only once 'run poll' loop.
Operational status changes are performed before administrative
status changes.
In a case of multiple operational status change or admin status
change in the same 'run poll' loop iteration, those changes are
merged to reach only the targeted status.
When using haproxy in front of distccd, it's possible to provide significant
improvements by only connecting when the preprocessing is completed, and by
selecting different farms depending on the payload size. This patch provides
two new sample fetch functions :
distcc_param(<token>[,<occ>]) : integer
distcc_body(<token>[,<occ>]) : binary
srv_queue([<backend>/]<server>) : integer
Returns an integer value corresponding to the number of connections currently
pending in the designated server's queue. If <backend> is omitted, then the
server is looked up in the current backend. It can sometimes be used together
with the "use-server" directive to force to use a known faster server when it
is not much loaded. See also the "srv_conn", "avg_queue" and "queue" sample
fetch methods.
Commit bcb86ab ("MINOR: session: add a streams field to the session
struct") added this list of streams that is not needed anymore. Let's
get rid of it now.
The warning appears when building with 51Degrees release that uses a new
Hash Trie algorithm (release version 3.2.12.12):
src/51d.c: In function init_51degrees:
src/51d.c:566:2: warning: enumeration value DATA_SET_INIT_STATUS_TOO_MANY_OPEN_FILES not handled in switch [-Wswitch]
switch (_51d_dataset_status) {
^
This patch can be backported in 1.7.
When
1) HAProxy configured to enable splice on both directions
2) After some high load, there are 2 input channels with their socket buffer
being non-empty and pipe being full at the same time, sitting in `fd_cache`
without any other fds.
The 2 channels will repeatedly be stopped for receiving (pipe full) and waken
for receiving (data in socket), thus getting out and in of `fd_cache`, making
their fd swapping location in `fd_cache`.
There is a `if (entry < fd_cache_num && fd_cache[entry] != fd) continue;`
statement in `fd_process_cached_events` to prevent frequent polling, but since
the only 2 fds are constantly swapping location, `fd_cache[entry] != fd` will
always hold true, thus HAProxy can't make any progress.
The root cause of the issue is dual :
- there is a single fd_cache, for next events and for the ones being
processed, while using two distinct arrays would avoid the problem.
- the write side of the stream interface wakes the read side up even
when it couldn't write, and this one really is a bug.
Due to CF_WRITE_PARTIAL not being cleared during fast forwarding, a failed
send() attempt will still cause ->chk_rcv() to be called on the other side,
re-creating an entry for its connection fd in the cache, causing the same
sequence to be repeated indefinitely without any opportunity to make progress.
CF_WRITE_PARTIAL used to be used for what is present in these tests : check
if a recent write operation was performed. It's part of the CF_WRITE_ACTIVITY
set and is tested to check if timeouts need to be updated. It's also used to
detect if a failed connect() may be retried.
What this patch does is use CF_WROTE_DATA() to check for a successful write
for connection retransmits, and to clear CF_WRITE_PARTIAL before preparing
to send in stream_int_notify(). This way, timeouts are still updated each
time a write succeeds, but chk_rcv() won't be called anymore after a failed
write.
It seems the fix is required all the way down to 1.5.
Without this patch, the only workaround at this point is to disable splicing
in at least one direction. Strictly speaking, splicing is not absolutely
required, as regular forwarding could theorically cause the issue to happen
if the timing is appropriate, but in practice it appears impossible to
reproduce it without splicing, and even with splicing it may vary.
The following config manages to reproduce it after a few attempts (haproxy
going 100% CPU and having to be killed) :
global
maxpipes 50000
maxconn 10000
listen srv1
option splice-request
option splice-response
bind :8001
server s1 127.0.0.1:8002
server$ tcploop 8002 L N20 A R10 S1000000 R10 S1000000 R10 S1000000 R10 S1000000 R10 S1000000
client$ tcploop 8001 N20 C T S1000000 R10 J
url_dec sample converter uses url_decode function to decode an URL. This
function fails by returning -1 when an invalid character is found. But the
sample converter never checked the return value and it used it as length for the
decoded string. Because it always succeeded, the invalid sample (with a string
length set to -1) could be used by other sample fetches or sample converters,
leading to undefined behavior like segfault.
The fix is pretty simple, url_dec sample converter just needs to return an error
when url_decode fails.
This patch must be backported in 1.7 and 1.6.
I misplaced the "if (!fdt.owner)" test so it can occasionally crash
when dumping an fd that's already been closed but still appears in
the table. It's not critical since this was not pushed into any
release nor backported though.
Health check currently cheat, they allocate a connection upon startup and never
release it, it's only recycled. The problem with doing this is that this code
is preventing the connection code from evolving towards multiplexing.
This code ensures that it's safe for the checks to run without a connection
all the time. Given that the code heavily relies on CO_FL_ERROR to signal
check errors, it is not trivial but in practice this is the principle adopted
here :
- the connection is not allocated anymore on startup
- new checks are not supposed to have a connection, so an attempt is made
to allocate this connection in the check task's context. If it fails,
the check is aborted on a resource error, and the rare code on this path
verifying the connection was adjusted to check for its existence (in
practice, avoid to close it)
- returning checks necessarily have a valid connection (which may possibly
be closed).
- a "tcp-check connect" rule tries to allocate a new connection before
releasing the previous one (but after closing it), so that if it fails,
it still keeps the previous connection in a closed state. This ensures
a connection is always valid here
Now it works well on all tested cases (regular and TCP checks, even with
multiple reconnections), including when the connection is forced to NULL or
randomly allocated.
The tcp-checks are very fragile. They can modify a connection's FD by
closing and reopening a socket without informing the connection layer,
which may then possibly touch the wrong fd. Given that the events are
only cleared and that the fd is just created, there should be no visible
side effect because the old fd is deleted so even if its flags get cleared
they were already, and the new fd already has them cleared as well so it's
a NOP. Regardless, this is too fragile and will not resist to threads.
In order to address this situation, this patch makes tcpcheck_main()
indicate if it closed a connection and report it to wake_srv_chk(), which
will then report it to the connection's fd handler so that it refrains
from updating the connection polling and the fd. Instead the connection
polling status is updated in the wake() function.
When tcp-checks are in use, a connection starts to be created, then it's
destroyed so that tcp-check can recreate its own. Now we directly move
to tcpcheck_main() when it's detected that tcp-check is in use.
The following config :
backend tcp9000
option tcp-check
tcp-check comment "this is a comment"
tcp-check connect port 10000
server srv 127.0.0.1:9000 check inter 1s
will result in a connection being first made to port 9000 then immediately
destroyed and re-created on port 10000, because the first rule is a comment
and doesn't match the test for the first rule being a connect(). It's
mostly harmless (unless the server really must not receive empty
connections) and the workaround simply consists in removing the comment.
Let's proceed like in other places where we simply skip leading comments.
A new function was made to make this lookup les boring. The fix should be
backported to 1.7 and 1.6.
Since this connection is not used at all anymore, do not allocate it.
It was verified that check successes and failures (both synchronous
and asynchronous) continue to be properly reported.
Upon fork() error, a first report is immediately made by connect_proc_chk()
via set_server_check_status(), then process_chk_proc() detects the error
code and makes up a dummy connection error to call chk_report_conn_err(),
which tries to retrieve the errno code from the connection, fails, then
saves the status message from the check, fails all "if" tests on its path
related to the connection then resets the check's state to the current one
with the current status message. All this useless chain is the only reason
why process checks require a connection! Let's simply get rid of this second
useless call.
The external process check code abused a little bit from copy-pasting to
the point of making think it requires a connection... The initialization
code only returns SF_ERR_NONE and SF_ERR_RESOURCE, so the other one can
be folded there. The code now only uses the connection to report the
error status.
Amazingly, this function takes a connection to report an error and is used
by process checks, placing a hard dependency between the connection and the
check preventing the mux from being completely implemented. Let's first get
rid of this.
A config containing "stats socket /path/to/socket mode admin" used to
silently start and be unusable (mode 0, level user) because the "mode"
parser doesn't take care of non-digits. Now it properly reports :
[ALERT] 276/144303 (7019) : parsing [ext-check.cfg:4] : 'stats socket' : ''mode' : missing or invalid mode 'admin' (octal integer expected)'
This can probably be backported to 1.7, 1.6 and 1.5, though reporting
parsing errors in very old versions probably isn't a good idea if the
feature was left unused for years.
This function can destroy a socket and create a new one, resulting in a
change of FD on the connection between recv() and send() for example,
which is absolutely not permitted, and can result in various funny games
like polling not being properly updated (or with the flags from a previous
fd) etc.
Let's only call this from the wake() callback which is more tolerant.
Ideally the operations should be made even more reliable by returning
a specific value to indicate that the connection was released and that
another one was created. But this is hasardous for stable releases as
it may reveal other issues.
This fix should be backported to 1.7 and 1.6.
In the rare case where the "tcp-check send" directive is the last one in
the list, it leaves the loop without sending the data. Fortunately, the
polling is still enabled on output, resulting in the connection handler
calling back to send what remains, but this is ugly and not very reliable.
This may be backported to 1.7 and 1.6.
While porting the connection to use the mux layer, it appeared that
tcp-checks wouldn't receive anymore because the polling is not enabled
before attempting to call xprt->rcv_buf() nor xprt->snd_buf(), and it
is illegal to call these functions with polling disabled as they
directly manipulate the FD state, resulting in an inconsistency where
the FD is enabled and the connection's polling flags disabled.
Till now it happened to work only because when recv() fails on EAGAIN
it calls fd_cant_recv() which enables polling while signaling the
failure, so that next time the message is received. But the connection's
polling is never enabled, and any tiny change resulting in a call to
conn_data_update_polling() immediately disables reading again.
It's likely that this problem already happens on some corner cases
such as multi-packet responses. It definitely breaks as soon as the
response buffer is full but we don't support consuming more than one
response buffer.
This fix should be backported to 1.7 and 1.6. In order to check for the
proper behaviour, this tcp-check must work and clearly show an SSH
banner in recvfrom() as observed under strace, otherwise it's broken :
tcp-check connect port 22
tcp-check expect rstring SSH
tcp-check send blah
This used to be needed to know whether there was a check in progress a
long time ago (before tcp_checks) but this is not true anymore and even
becomes wrong after the check is reused as conn_init() initializes it
to DEAD_FD_MAGIC.
A regression has been introduced in commit
00005ce5a1: the port being changed is the
one from 'cli_conn->addr.from' instead of 'cli_conn->addr.to'.
This patch fixes the regression.
Backport status: should be backported to HAProxy 1.7 and above.
Leaving the maintenance state and if the server remains in stopping mode due
to a tracked one:
- We mistakenly try to grab some pending conns and shutdown backup sessions.
- The proxy down time and last change were also mistakenly updated
This patch adds the ability to read from a wrapping memory area (ie:
buffers). The new functions are called "readv_<type>". The original
ones were renamed to start with "read_" to make the difference more
obvious between the read method and the returned type.
It's worth noting that the memory barrier in readv_bytes() is critical,
as otherwise gcc decides that it doesn't need the resulting data, but
even worse, removes the length checks in readv_u64() and happily
performs an out-of-bounds unaligned read using read_u64()! Such
"optimizations" are a bit borderline, especially when they impact
security like this...
snr_resolution_cb can be called with <nameserver> parameter set to NULL. So we
must check it before using it. This is done most of time, except when we deal
with invalid DNS response.
The check was totally messed up. In the worse case, it led to a crash, when
res.comp_algo sample fetch was retrieved on uncompressed response (with the
compression enabled).
This patch must be backported in 1.7.
There are several places where we see feconn++, feconn--, totalconn++ and
an increment on the frontend's number of connections and connection rate.
This is done exactly once per session in each direction, so better take
care of this counter in the session and simplify the callers. At least it
ensures a better symmetry. It also ensures consistency as till now the
lua/spoe/peers frontend didn't have these counters properly set, which can
be useful at least for troubleshooting.
session_accept_fd() may either successfully complete a session creation,
or defer it to conn_complete_session() depending of whether a handshake
remains to be performed or not. The problem is that all the code after
the handshake was duplicated between the two functions.
This patch make session_accept_fd() synchronously call
conn_complete_session() to finish the session creation. It is only needed
to check if the session's task has to be released or not at the end, which
is fairly minimal. This way there is now a single place where the sessions
are created.
Commit 8e3c6ce ("MEDIUM: connection: get rid of data->init() which was
not for data") simplified conn_complete_session() but introduced a
confusing check which cannot happen on CO_FL_HANDSHAKE. Make it clear
that this call is final and will either succeed and complete the
session or fail.
Instead of duplicating some sensitive listener-specific code in the
session and in the stream code, let's call listener_release() when
releasing a connection attached to a listener.
Each user of a session increments/decrements the jobs variable at its
own place, resulting in a real mess and inconsistencies between them.
Let's have session_new() increment jobs and session_free() decrement
it.
Some places call delete_listener() then decrement the number of
listeners and jobs. At least one other place calls delete_listener()
without doing so, but since it's in deinit(), it's harmless and cannot
risk to cause zombie processes to survive. Given that the number of
listeners and jobs is incremented when creating the listeners, it's
much more logical to symmetrically decrement them when deleting such
listeners.
This function is used to create a series of listeners for a specific
address and a port range. It automatically calls the matching protocol
handlers to add them to the relevant lists. This way cfgparse doesn't
need to manipulate listeners anymore. As an added bonus, the memory
allocation is checked.
Since everything is self contained in proto_uxst.c there's no need to
export anything. The same should be done for proto_tcp.c but the file
contains other stuff that's not related to the TCP protocol itself
and which should first be moved somewhere else.
cfgparse has no business directly calling each individual protocol's 'add'
function to create a listener. Now that they're all registered, better
perform a protocol lookup on the family and have a standard ->add method
for all of them.
It's a shame that cfgparse() has to make special cases of each protocol
just to cast the port to the target address family. Let's pass the port
in argument to the function. The unix listener simply ignores it.
It's pointless to read it on each and every accept(), as we only need
it for reporting in debugging mode a few lines later. Let's move this
part to the relevant block.
Since v1.7 it's pointless to reference a listener when greating a session
for an outgoing connection, it only complicates the code. SPOE and Lua were
cleaned up in 1.8-dev1 but the peers code was forgotten. This patch fixes
this by not assigning such a listener for outgoing connections. It also has
the extra benefit of not discounting the outgoing connections from the number
of allowed incoming connections (the code currently adds a safety marging of
3 extra connections to take care of this).
Adds cli commands to change at runtime whether informational messages
are prepended with severity level or not, with support for numeric and
worded severity in line with syslog severity level.
Adds stats socket config keyword severity-output to set default behavior
per socket on startup.
These notification management function and structs are generic and
it will be better to move in common parts.
The notification management functions and structs have names
containing some "lua" references because it was written for
the Lua. This patch removes also these references.
When we try to access to other proxy context, we must check
its existence because haproxy can kill it between the creation
and the usage.
This patch should be backported in 1.6 and 1.7
A previous fix was made to prevent the connection to a server if a redirect was
performed during the request processing when we wait to keep the client
connection alive. This fix introduced a pernicious bug. If a client closes its
connection immediately after sending a request, it is possible to keep stream
alive infinitely. This happens when the connection closure is caught when the
request is received, before the request parsing.
To be more specific, this happens because the close event is not "forwarded",
first because of the call to "channel_dont_connect" in the function
"http_apply_redirect_rule", then because we want to keep the client connection
alive, we explicitly call "channel_dont_close" in the function
"http_request_forward_body".
So, to fix the bug, instead of blocking the server connection, we force its
shutdown. This will force the stream to re-evaluate all connexions states. So it
will detect the client has closed its connection.
This patch must be backported in 1.7.
smp_fetch_ssl_fc_cl_str as very limited usage (only work with openssl == 1.0.2
compiled with the option enable-ssl-trace). It use internal cipher.algorithm_ssl
attribut and SSL_CIPHER_standard_name (available with ssl-trace).
This patch implement this (debug) function in a standard way. It used common
SSL_CIPHER_get_name to display cipher name. It work with openssl >= 1.0.2
and boringssl.
This reverts commit 19e8aa58f7.
It causes some trouble reported by Manu :
listen tls
[...]
server bla 127.0.0.1:8080
[ALERT] 248/130258 (21960) : parsing [/etc/haproxy/test.cfg:53] : 'server bla' : no method found to resolve address '(null)'
[ALERT] 248/130258 (21960) : Failed to initialize server(s) addr.
According to Nenad :
"It's not a good way to fix the issue we were experiencing
before. It will need a bigger rewrite, because the logic in
srv_iterate_initaddr needs to be changed."
Historically the DNS was the only way of updating the server IP dynamically
and the init-addr processing and state file load required the server to have
an FQDN defined. Given that we can now update the IP through the socket as
well and also can have different init-addr values (like IP and 'none') - this
requirement needs to be removed.
This patch should be backported to 1.7.
Since commit 5be2f35 ("MAJOR: polling: centralize calls to I/O callbacks")
that came into 1.6-dev1, each poller deals with its own events and decides
to signal ability to receive or send on a file descriptor based on the
active events on the file descriptor.
The commit above was incorrectly done for the epoll code. Instead of
checking the active events on the fd, it checks for the new events. In
general these ones are the same for POLL_IN and POLL_OUT since they
are always cleared prior to being computed, but it is possible that
POLL_HUP and POLL_ERR were initially reported and are not reported
again (especially for HUP). This could happen for example if POLL_HUP
and POLL_IN were received together, the pending data exactly correspond
to a full buffer which is read at once, preventing the POLL_HUP from
being dealt with in the same call, and on the next call only POLL_OUT
is reported (eg: to emit some response or peers protocol ACKs). In this
case fd_may_recv() will not be enabled anymore and the close event will
be missed.
It seems quite hard to trigger this case, though it might explain some
of the rare missed close events that were detected in the past on the
peers.
This fix needs to be backported to 1.6 and 1.7.
The server state and weight was reworked to handle
"pending" values updated by checks/CLI/LUA/agent.
These values are commited to be propagated to the
LB stack.
In further dev related to multi-thread, the commit
will be handled into a sync point.
Pending values are named using the prefix 'next_'
Current values used by the LB stack are named 'cur_'
This string is used in sample fetches so it is safe to use a preallocated trash
chunk instead of a buffer dynamically allocated during HAProxy startup.
First, this variable does not need to be publicly exposed because it is only
used by stick_table functions. So we declare it as a global static in
stick_table.c file. Then, it is useless to use a pointer. Using a plain struct
variable avoids any dynamic allocation.
swap_buffer is a global variable only used by buffer_slow_realign. So it has
been moved from global.h to buffer.c and it is allocated by init_buffer
function. deinit_buffer function has been added to release it. It is also used
to destroy the buffers' pool.
During the configuration parsing, log buffers are reallocated when
global.max_syslog_len is updated. This can be done serveral time. So, instead of
doing it serveral time, we do it only once after the configuration parsing.
Now, we use init_log_buffers and deinit_log_buffers to, respectively, initialize
and deinitialize log buffers used for syslog messages.
These functions have been introduced to be used by threads, to deal with
thread-local log buffers.
Trash buffers are reallocated when "tune.bufsize" parameter is changed. Here, we
just move the realloc after the configuration parsing.
Given that the config parser doesn't rely on the trash size, it should be
harmless.
Now, we use init_trash_buffers and deinit_trash_buffers to, respectively,
initialize and deinitialize trash buffers (trash, trash_buf1 and trash_buf2).
These functions have been introduced to be used by threads, to deal with
thread-local trash buffers.