process_runnable_tasks() needs to requeue or wake up tasks after
processing them in batches. By only refilling the existing ones, we
avoid revisiting all the queue. The performance gain is measurable
starting with two threads, where the request rate climbs to 657k/s
compared to 644k.
This patch slightly rearranges the loop to pack the locked code a little
bit, and to try to concentrate accesses to the tree together to benefit
more from the cache.
It also fixes how the loop handles the right margin : now that is guaranteed
that the retrieved nodes are filtered to only match the current thread, we
don't need to rewind every 16 entries. Instead we can rewind each time we
reach the right margin again.
With this change, we now achieve the following performance for 10 H2 conns
each containing 100 streams :
1 thread : 550kreq/s
2 thread : 644kreq/s
3 thread : 598kreq/s
This function is sensitive, let's make it shorter by factoring out the
unlock and leave code. This reduced the function's size by a few tens
of bytes and increased the overall performance by about 1%.
Currently the task scheduler suffers from an O(n) lookup when
skipping tasks that are not for the current thread. The reason
is that eb32_lookup_ge() has no information about the current
thread so it always revisits many tasks for other threads before
finding its own tasks.
This is particularly visible with HTTP/2 since the number of
concurrent streams created at once causes long series of tasks
for the same stream in the scheduler. With only 10 connections
and 100 streams each, by running on two threads, the performance
drops from 640kreq/s to 11.2kreq/s! Lookup metrics show that for
only 200000 task lookups, 430 million skips had to be performed,
which means that on average, each lookup leads to 2150 nodes to
be visited.
This commit backports the principle of scope lookups for ebtrees
from the ebtree_v7 development tree. The idea is that each node
contains a mask indicating the union of the scopes for the nodes
below it, which is fed during insertion, and used during lookups.
Then during lookups, branches that do not contain any leaf matching
the requested scope are simply ignored. This perfectly matches a
thread mask, allowing a thread to only extract the tasks it cares
about from the run queue, and to always find them in O(log(n))
instead of O(n). Thus the scheduler uses tid_bit and
task->thread_mask as the ebtree scope here.
Doing this has recovered most of the performance, as can be seen on
the test below with two threads, 10 connections, 100 streams each,
and 1 million requests total :
Before After Gain
test duration : 89.6s 4.73s x19
HTTP requests/s (DEBUG) : 11200 211300 x19
HTTP requests/s (PROD) : 15900 447000 x28
spin_lock time : 85.2s 0.46s /185
time per lookup : 13us 40ns /325
Even when going to 6 threads (on 3 hyperthreaded CPU cores), the
performance stays around 284000 req/s, showing that the contention
is much lower.
A test showed that there's no benefit in using this for the wait queue
though.
The first pid in the pidfile is now the parent, it's more convenient for
supervising the processus.
You can now reload haproxy in master-worker mode with convenient command
like: kill -USR2 $(head -1 /tmp/haproxy.pid)
Commit 0493149 ("MINOR: thread: report multi-thread support in haproxy -vv")
added information about thread support in haproxy -vv output but accidently
marked the message as "must_free" while it's a constant. This causes a segv
on the old process on clean exit if threads are enabled. It doesn't affect
the stability during operations however.
unbind_listener() takes the listener lock, which is already held by
enable_listener(). This situation happens when starting with nbproc > 1
with some bind lines limited to a certain process, because in this case
enable_listener() tries to stop unneeded listeners.
This commit introduces __do_unbind_listeners() which must be called with
the lock held, and makes enable_listener() use this one. Given that the
only return code has never been used and that it starts to make the code
more complicated to propagate it before throwing it to the trash, the
function's return type was changed to void.
The spin_unlock() was called just before setting the expiry to
TICK_ETERNITY, so if another thread has the time to perform its
update and set a timeout, this would would clear it.
Commit c3680ec ("MINOR: add severity information to cli feedback messages")
introduced a severity level to CLI messages, but one of them was missed
on "set server addr". No backport is needed.
Given that all spinning loops we've had since 1.8-rc1 were caused by
unbalanced lock/unlock, let's get rid of all return statements in the
locked check functions and only exit via a a single unlock place.
The "set server <srv> check-port" CLI handler forgot to return after
detecting an error on the port number, and still proceeds with the action.
This needs to be backported to 1.7.
Some unlocks were missing, resulting in deadlocks even with a single thread.
We really need to make these functions safer by getting rid of all those
remaining "return" calls and only leave using a goto!
Using peers or stick table we could update an freq_ctr
using a tick value with the first bit set but this
bit is reserved for lock since multithreading support.
Fix bugs due to missing unlock and recursive lock performing
http health check.
The server's lock scope was enlarged to protect all callers
of 'set_server_check_status' and 'chk_report_conn_err'.
This fix also protects tcpcheck against concurrency.
Before introducing the mux layer, tcp_connect() would poll for sending
to detect the connection establishment. It happens that the health
checks have apparently never explicitly enabled this polling and have
been relying on this implicit one.
Now that there's the mux layer, the conn_stream needs to be enabled
for polling as well and since it's not done in the checks, it's never
done and the check's request doesn't leave the machine, as can be
noticed with http checks.
The solution simply consists in going back to the well-known case
where we enable polling after connecting using cs_want_send() if we
have anything but just a plain connect(). The regular data path is
not affected because the stream interface code automatically computes
the polling needs based on buffer contents.
This situation which must not happen does in fact happen when feeding
artificial responses using errorfiles, Lua or an applet. For now it
causes the H1 response parser to loop forever trying to get a more
complete response. Since it cannot progress, let's return an error.
Don't bother testing if len is nonzero, we know it is, as we're in the
"else" part of a if (!len), and testing it confuses clang into thinking
ret may be left uninitialized.
Store object in the cache. The cache use an shctx for storage.
It uses an http-response action to store the headers and a filter to
store the body. The http-response action is used in order to allow
modifications by other actions before caching.
Forbid shctx to read more than expected, it allows you to use a greater
value as a len with shctx_row_data_get(), the size of the destination
buffer for example.
Previous commit ea3928 (MEDIUM: h2: apply a timeout to h2 connections)
was wrong for two reasons. The first one is that if the client timeout
is not set, it's used as zero, preventing connections from establishing.
The second reason is that if the timeout triggers with active streams
(normally it should not since the task is supposed to be disabled), the
task is removed (h2c->task=NULL), and the last quitting stream might
try to dereference it.
Instead of doing this, we simply not register the task if there's no
timeout (it's useless) and we always control its presence in the streams.
Till now there was no way to deal with a dead H2 connection. Now each
connection creates a task that wakes up to kill the connection. Its
timeout is constantly refreshed when there's some activity. In case
the timeout triggers, the best effort attempts are made at sending a
clean GOAWAY message before closing and signaling the streams.
The timeout is automatically disabled when there's an active stream on
the connection, and restarted when the last stream finishes. This way
it should not affect long sessions.
Given that we're processing data produced by haproxy, we know that the
situations where haproxy doesn't return anything are :
- request timeout with option http-ignore-probes : there's no reason to
hit this since we're creating the stream with the request into it ;
- tcp-request content reject : this definitely means we want to kill the
connection and abort keep-alive and any further processing ;
- using /dev/null as the error file to hide an error
In practice it appears that using the abort on empty response as a hint to
trigger a connection close is very appropriate to continue to give the
control over the connection management. This patch thus tries to send a
GOAWAY frame with the max_id presented as the last stream ID, then sends
an RST_STREAM for the current stream. For the client, this means that the
connection must be shut down immediately after processing the last pending
streams and that the current stream is aborted. This way it's still possible
to force connections to be closed using tcp-request rules.
After some long brainstorming sessions, it appears that "Connection: close"
seems to be the best signal from the L7 layer to indicate the need to close
the connection. Indeed, in H1 it is only present in very rare cases (eg:
certain unrecoverable errors, some of which could remove it now by the way).
It will also be added when the L7 layer wants to force the connection to
terminate. By default when running in keep-alive mode it is not present.
It's worth mentionning that in H1 with persistent connections, we have sort
of a concurrency-1 mux and this header field is used the same way.
Thus here this patch detects "Connection: close" in response headers and
if seen, sends a GOAWAY frame with the highest possible ID so that the
client knows that it can quit whenever it wants to. If more aggressive
closures are needed in the future, we may decide to advertise the max_id
to abort after the current requests and better honor "http-request deny".
For a graceful shutdown, the specs requries to discard frames with a
stream ID higher than the advertised last_id. (RFC7540#6.8). Well,
finally for now the code is disabled (see last page of #6.8). Some
frames need to be processed anyway to maintain the compression state
and the flow control window state, but we don't have any trivial way
to do this and ignore them at the same time. For the headers it's
the worst case where we can't parse headers frames without coming
from the streams, and we don't want to create such streams as we'd
have to abort them, and aborting would cause errors to flow back.
Possibly that a longterm solution might involve using some dummy
streams and dummy buffers for this and calling the parsers directly.
RFC7540#5.1 is pretty clear : "any frame other than WINDOW_UPDATE,
PRIORITY, or RST_STREAM in this state MUST be treated as a connection
error of type STREAM_CLOSED". Instead of dealing with this for each
and every frame type, let's do it once for all in the main demux loop.
RFC7540#5.1 is pretty clear : "any frame other than HEADERS or PRIORITY
in this state MUST be treated as a connection error". Instead of dealing
with this for each and every frame type, let's do it once for all in the
main demux loop.