3091 Commits

Author SHA1 Message Date
Willy Tarreau
8b94969054 MINOR: fd: cache-align fdtab and fdcache locks
These locks are highly contended, let's not make them share cache lines.
2017-11-26 11:10:51 +01:00
Willy Tarreau
53bae85b8e BUG/MINOR: threads: don't drop "extern" on the lock in include files
Commit 9dcf9b6 ("MINOR: threads: Use __decl_hathreads to declare locks")
accidently lost a few "extern" in certain lock declarations, possibly
causing certain entries to be declared at multiple places. Apparently
it hasn't caused any harm though.

The offending ones were :
  - fdtab_lock
  - fdcache_lock
  - poll_lock
  - buffer_wq_lock
2017-11-26 11:10:50 +01:00
William Lallemand
4cfede87a3 MAJOR: mworker: exits the master on failure
This patch changes the behavior of the master during the exit of a
worker.

When a worker exits with an error code, for example in the case of a
segfault, all workers are now killed and the master leaves.

If you don't want this behavior you can use the option
"master-worker no-exit-on-failure".
2017-11-24 22:48:27 +01:00
Willy Tarreau
bafbe01028 CLEANUP: pools: rename all pool functions and pointers to remove this "2"
During the migration to the second version of the pools, the new
functions and pool pointers were all called "pool_something2()" and
"pool2_something". Now there's no more pool v1 code and it's a real
pain to still have to deal with this. Let's clean this up now by
removing the "2" everywhere, and by renaming the pool heads
"pool_head_something".
2017-11-24 17:49:53 +01:00
Olivier Houchard
fbc74e8556 MINOR/CLEANUP: proxy: rename "proxy" to "proxies_list"
Rename the global variable "proxy" to "proxies_list".
There's been multiple proxies in haproxy for quite some time, and "proxy"
is a potential source of bugs, a number of functions have a "proxy" argument,
and some code used "proxy" when it really meant "px" or "curproxy". It worked
by pure luck, because it usually happened while parsing the config, and thus
"proxy" pointed to the currently parsed proxy, but we should probably not
rely on this.

[wt: some of these are definitely fixes that are worth backporting]
2017-11-24 17:21:27 +01:00
Christopher Faulet
767a84bcc0 CLEANUP: log: Rename Alert/Warning in ha_alert/ha_warning 2017-11-24 17:19:12 +01:00
Christopher Faulet
c644fa9bf5 MINOR: config: Add threads support for "process" option on "bind" lines
It is now possible on a "bind" line (or a "stats socket" line) to specify the
thread set allowed to process listener's connections. For instance:

    # HTTPS connections will be processed by all threads but the first and HTTP
    # connection will be processed on the first thread.
    bind *:80 process 1/1
    bind *:443 ssl crt mycert.pem process 1/2-
2017-11-24 15:38:50 +01:00
Christopher Faulet
cb6a94510d MINOR: config: Add the threads support in cpu-map directive
Now, it is possible to bind CPU at the thread level instead of the process level
by defining a thread set in "cpu-map" directives. Thus, its format is now:

  cpu-map [auto:]<process-set>[/<thread-set>] <cpu-set>...

where <process-set> and <thread-set> must follow the format:

  all | odd | even | number[-[number]]

Having a process range and a thread range in same time with the "auto:" prefix
is not supported. Only one range is supported, the other one must be a fixed
number. But it is allowed when there is no "auto:" prefix.

Because it is possible to define a mapping for a process and another for a
thread on this process, threads will be bound on the intersection of their
mapping and the one of the process on which they are attached. If the
intersection is null, no specific binding will be set for the threads.
2017-11-24 15:38:50 +01:00
Christopher Faulet
26028f6209 MINOR: config: Add auto-increment feature for cpu-map
The prefix "auto:" can be added before the process set to let HAProxy
automatically bind a process to a CPU by incrementing process and CPU sets. To
be valid, both sets must have the same size. No matter the declaration order of
the CPU sets, it will be bound from the lower to the higher bound.

  Examples:
      # all these lines bind the process 1 to the cpu 0, the process 2 to cpu 1
      #  and so on.
      cpu-map auto:1-4   0-3
      cpu-map auto:1-4   0-1 2-3
      cpu-map auto:1-4   3 2 1 0

      # bind each process to exaclty one CPU using all/odd/even keyword
      cpu-map auto:all   0-63
      cpu-map auto:even  0-31
      cpu-map auto:odd   32-63

      # invalid cpu-map because process and CPU sets have different sizes.
      cpu-map auto:1-4   0    # invalid
      cpu-map auto:1     0-3  # invalid
2017-11-24 15:38:49 +01:00
Christopher Faulet
ff8131861f MINOR: standard: Add my_ffsl function to get the position of the bit set to one 2017-11-24 15:38:49 +01:00
Christopher Faulet
f1f0c5f591 MINOR: config: Export parse_process_number and use it wherever it's applicable
This function is used when "bind-process" directive is parsed and when "process"
parameter on a "bind" or a "stats socket" line is parsed.
2017-11-24 15:38:49 +01:00
William Lallemand
f528fff46b MEDIUM: cache: store sha1 for hashing the cache key
The cache was relying on the txn->uri for creating its key, which was a
big problem when there was no log activated.

This patch does a sha1 of the host + uri, and stores it in the txn.
When a object is stored, the eb32node uses the first 32 bits of the hash
as a key, and the whole hash is stored in the cache entry.

During a lookup, the truncated hash is used, and when it matches an
entry we check the real sha1.
2017-11-23 20:20:04 +01:00
Olivier Houchard
90084a133d MINOR: ssl: Handle reading early data after writing better.
It can happen that we want to read early data, write some, and then continue
reading them.
To do so, we can't reuse tmp_early_data to store the amount of data sent,
so introduce a new member.
If we read early data, then ssl_sock_to_buf() is now the only responsible
for getting back to the handshake, to make sure we don't miss any early data.
2017-11-23 19:35:28 +01:00
Willy Tarreau
158fa75811 MINOR: pools: implement DEBUG_UAF to detect use after free
This code has been used successfully a few times in the past to detect
that a pool was used after being freed. Its main goal is to allocate a
full page for each object so that they are always released individually
and unmapped from memory. This way if any part of the code reference the
object after is was freed and before it is reallocated, a segv occurs at
the exact offending location. It does a few extra things such as writing
to the memory area before freeing to detect double-frees and free of
read-only areas, and placing the data at the end of the page instead of
the beginning so that out of bounds accesses are easier to spot. The
amount of memory used with this is huge (about 10 times the regular
usage) but it can be useful sometimes.
2017-11-22 19:43:57 +01:00
Willy Tarreau
f13322ede1 MINOR: pools: prepare functions to override malloc/free in pools
This will be useful to add some debugging capabilities. For now it
changes nothing.
2017-11-22 19:27:44 +01:00
William Lallemand
111bfef33c MEDIUM: shctx: use unsigned int for len and block_count
Allows bigger objects to be cached in the shctx, the first
implementation was only storing small ssl session, but we want to store
bigger HTTP response.
2017-11-21 21:35:04 +01:00
Willy Tarreau
59a10fb53d MEDIUM: h2: change hpack_decode_headers() to only provide a list of headers
The current H2 to H1 protocol conversion presents some issues which will
require to perform some processing on certain headers before writing them
so it's not possible to convert HPACK to H1 on the fly.

This commit modifies the headers decoding so that it now works in two
phases : hpack_decode_headers() only decodes the HPACK stream in the
HEADERS frame and puts the result into a list. Headers which require
storage (huffman-compressed or from the dynamic table) are stored in
a chunk allocated by the H2 demuxer. Then once the headers are properly
decoded into this list, h2_make_h1_request() is called with this list
to produce the HTTP/1.1 request into the destination buffer. The list
necessarily enforces a limit. Here we use 2*MAX_HTTP_HDR, which means
that we can have as many individual cookies as we have regular headers
if a client decides to break their cookies into multiple values. This
seams reasonable and will allow the H1 parser to decide whether it's
too much or not.

Thus the output stream is not produced on the fly anymore and this will
permit to deal with certain corner cases like reparing the Cookie header
(which for now is not done).

In order to limit header duplication and parsing, the known pseudo headers
continue to be passed by their index : the name element in the list then
has a NULL pointer and the value is the pseudo header's index. Given that
these ones represent about half of the incoming requests and need to be
found quickly, it maintains an acceptable level of performance.

The code was significantly reduced by doing this because the orignal code
had to deal with HPACK and H1 combinations (eg: index vs not indexed, etc)
and now the HPACK decoding is totally focused on the decompression, and
the H1 encoding doesn't have to deal with the issue of wrapping input for
example.

One bug was addressed here (though it couldn't happen at the moment). The
H2 demuxer used to detect a failure to write the request into the H1 buffer
and would then detect if the output buffer wraps, realign it and try again.
The problem by doing so was that the HPACK context was already modified and
not rewindable. Thus the size check is now performed first and a failure is
reported if it doesn't fit.
2017-11-21 21:13:36 +01:00
Willy Tarreau
f24ea8e45e MEDIUM: h2: add a function to emit an HTTP/1 request from a headers list
The current H2 to H1 protocol conversion presents some issues which will
require to perform some processing on certain headers before writing them
so it's not possible to convert HPACK to H1 on the fly.

Here we introduce a function which performs half of what hpack_decode_header()
used to do, which is to take a list of headers on input and emit the
corresponding request in HTTP/1.1 format. The code is the same and functions
were renamed to be prefixed with "h2" instead of "hpack", though it ends
up being simpler as the various HPACK-specific cases could be fused into
a single one (ie: add header).

Moving this part here makes a lot of sense as now this code is specific to
what is documented in HTTP/2 RFC 7540 and will be able to deal with special
cases related to H2 to H1 conversion enumerated in section 8.1.

Various error codes which were previously assigned to HPACK were never
used (aside being negative) and were all replaced by -1 with a comment
indicating what error was detected. The code could be further factored
thanks to this but this commit focuses on compatibility first.

This code is not yet used but builds fine.
2017-11-21 21:13:33 +01:00
Willy Tarreau
dbd25fc75a BUILD: compiler: add a new type modifier __maybe_unused
While gcc only emits warnings about unused static functions, Clang also
emits such a warning when the functions are inlined. This is a bit
annoying at certain places where functions are provided to manipulate
multiple data types and are not yet used. Let's have a type modifier
"__maybe_unused" which sets the "unused" attribute like the Linux kernel
does. It's elegant as it allows the code author to indicate that it knows
that this element might be unused. It works on variables as well, which
is convenient to remove ifdefs around local variables in certain functions,
but doesn't work on labels.
2017-11-20 21:27:27 +01:00
Willy Tarreau
2532bd2f81 BUILD: threads/plock: fix a build issue on Clang without optimization
[ plock commit 4c53fd3a0b2b1892817cebd0db012a52f4087850 ]

Pieter Baauw reported a build issue affecting haproxy after plock was
included. It happens that expressions of the form :

     if ((const) ? (expr1) : (expr2))
       do_something()

always produce code for both expr1 and expr2 on Clang when building
without optimization. The resulting asm code is even funny, basically
doing :

     mov reg, 1
     cmp reg, 1
     ...

This causes our sizeof() tests to fail to build because we purposely
dereference a fake function that reports the location and nature of the
inconsistency, but this fake function appears in the object code despite
all conditions being there to avoid it.

However the compiler is still smart enough to optimize away code doing

    if (const)
       do_something()

So we simply repeat the condition before do_something(), and the dummy
function is not referenced anymore unless really required.
2017-11-20 21:06:35 +01:00
Willy Tarreau
b5f271555e MINOR: threads/build: atomic: replace the few inlines with macros
[ plock commit 61e255286ae32e83e1a3174dd7c49eda99880a8b]

There are a few inlines such as pl_barrier() and pl_cpu_relax() which
are used a lot. Unfortunately, while building test code at -O0, inlining
is disabled and these ones are called a lot and show up a lot in any
profile, are traced into when single-stepping with a debugger, etc, thus
they are polluting the landscape. Since they're single-asm statements,
there is no reason for not turning them into macros.

The result becomes fairly visible here at -O0 :

  $ size latency.inline latency.macro
     text    data     bss     dec     hex filename
    11431     692     656   12779    31eb treelock.inline
    10967     692     656   12315    301b treelock.macro

And it was verified that regularly optimized code remains strictly identical.
2017-11-20 21:06:35 +01:00
Willy Tarreau
d0d8ba59d3 MINOR: threads/atomic: implement pl_bts() on non-x86
[ plock commit da17ba320aad3a8faf08e36fca604de9cad21fdd ]

This one was missing, it can be done using sync_fetch_and_or().
2017-11-20 21:06:03 +01:00
Willy Tarreau
01b8398b9e MINOR: threads/atomic: implement pl_mb() in asm on x86
[ plock commit 44081ea493dd78dab48076980e881748e9b33db5 ]

Older compilers (eg: gcc 3.4) don't provide __sync_synchronize() so let's
do it by hand on this platform.
2017-11-20 20:45:47 +01:00
Willy Tarreau
f7ba77eb80 MINOR: threads/plock: rename local variables in macros to avoid conflicts
[ plock commit b155d5c762fb9a9793911881f80e61faa6b0e889 ]

Local variables "l", "i" and "ret" were renamed "__pl_l", "__pl_i" and
"__pl_r" respectively, to limit the risk of conflicts with existing
variables in application code.
2017-11-20 20:45:43 +01:00
Willy Tarreau
98409e34ca MINOR: threads/atomic: rename local variables in macros to avoid conflicts
[ plock commit bfac5887ebabb8ef753b0351f162265767eb219b ]

Local variable "t" was renamed "__pl_t" to limit the risk of conflicts
with existing variables in application code.
2017-11-20 20:45:38 +01:00
William Lallemand
71bd11a1f3 MEDIUM: cache: enable the HTTP analysers
Enable the same analysers as the stats applet.
Allows keepalive and termination flags to work.
2017-11-20 19:22:27 +01:00
William Lallemand
44e259c0b7 CLEANUP: cache: remove unused struct
Remove unused structure which remain from old dev.
2017-11-20 19:22:27 +01:00
Tim Duesterhus
d6942c8297 MEDIUM: mworker: Add systemd Type=notify support
This patch adds support for `Type=notify` to the systemd unit.

Supporting `Type=notify` improves both starting as well as reloading
of the unit, because systemd will be let known when the action completed.

See this quote from `systemd.service(5)`:
> Note however that reloading a daemon by sending a signal (as with the
> example line above) is usually not a good choice, because this is an
> asynchronous operation and hence not suitable to order reloads of
> multiple services against each other. It is strongly recommended to
> set ExecReload= to a command that not only triggers a configuration
> reload of the daemon, but also synchronously waits for it to complete.

By making systemd aware of a reload in progress it is able to wait until
the reload actually succeeded.

This patch introduces both a new `USE_SYSTEMD` build option which controls
including the sd-daemon library as well as a `-Ws` runtime option which
runs haproxy in master-worker mode with systemd support.

When haproxy is running in master-worker mode with systemd support it will
send status messages to systemd using `sd_notify(3)` in the following cases:

- The master process forked off the worker processes (READY=1)
- The master process entered the `mworker_reload()` function (RELOADING=1)
- The master process received the SIGUSR1 or SIGTERM signal (STOPPING=1)

Change the unit file to specify `Type=notify` and replace master-worker
mode (`-W`) with master-worker mode with systemd support (`-Ws`).

Future evolutions of this feature could include making use of the `STATUS`
feature of `sd_notify()` to send information about the number of active
connections to systemd. This would require bidirectional communication
between the master and the workers and thus is left for future work.
2017-11-20 18:39:41 +01:00
Olivier Houchard
e6060c5d87 MINOR: SSL: Store the ASN1 representation of client sessions.
Instead of storing the SSL_SESSION pointer directly in the struct server,
store the ASN1 representation, otherwise, session resumption is broken with
TLS 1.3, when multiple outgoing connections want to use the same session.
2017-11-16 19:03:32 +01:00
Christopher Faulet
595d7b72a6 MINOR: applets: Use a bitfield to track applets activity per-thread
a bitfield has been added to know if there are runnable applets for a
thread. When an applet is woken up, the bits corresponding to its thread_mask
are set. When all active applets for a thread is get to be processed, the thread
is removed from active ones by unsetting its tid_bit from the bitfield.
2017-11-16 11:19:46 +01:00
Christopher Faulet
3911ee85df MINOR: tasks: Use a bitfield to track tasks activity per-thread
a bitfield has been added to know if there are runnable tasks for a thread. When
a task is woken up, the bits corresponding to its thread_mask are set. When all
tasks for a thread have been evaluated without any wakeup, the thread is removed
from active ones by unsetting its tid_bit from the bitfield.
2017-11-16 11:19:46 +01:00
William Lallemand
75ea0a06b0 BUG/MEDIUM: mworker: does not close inherited FD
At the end of the master initialisation, a call to protocol_unbind_all()
was made, in order to close all the FDs.

Unfortunately, this function closes the inherited FDs (fd@), upon reload
the master wasn't able to reload a configuration with those FDs.

The create_listeners() function now store a flag to specify if the fd
was inherited or not.

Replace the protocol_unbind_all() by  mworker_cleanlisteners() +
deinit_pollers()
2017-11-15 19:53:33 +01:00
Willy Tarreau
9c1e15d8cd MINOR: tools: emphasize the node being worked on in the tree dump
Now we can show in dotted red the node being removed or surrounded in red
a node having been inserted, and add a description on the graph related to
the operation in progress for example.
2017-11-15 19:43:05 +01:00
Willy Tarreau
ed3cda02ae MINOR: tools: add a function to dump a scope-aware tree to a file
It emits a dump in DOT format for graphing purposes during debugging
sessions. It's convenient to dump the run queue.
2017-11-15 16:07:15 +01:00
Christopher Faulet
99bca65f53 BUG/MEDIUM: standard: itao_str/idx and quote_str/idx must be thread-local
This bug has an impact on the stats applet and easily leads to a crash of HAProxy.

This is specific to threads, no backport is needed.
2017-11-14 18:11:57 +01:00
Christopher Faulet
e9a896e09e BUG/MINOR: threads: tid_bit must be a unsigned long
This is specific to threads, no backport is needed.
2017-11-14 18:11:28 +01:00
Christopher Faulet
fa5c812a6b BUG/MINOR: buffers: Fix b_alloc_margin to be "fonctionnaly" thread-safe
b_alloc_margin is, strickly speeking, thread-safe. It will not crash
HAproxy. But its contract is not respected anymore in a multithreaded
environment. In this function, we need to be sure to have <margin> buffers
available in the pool after the allocation. So to have this guarantee, we must
lock the memory pool during all the operation. This also means, we must call
internal and lockless memory functions (prefixed with '__').

For the record, this patch fixes a pernicious bug happens after a soft reload
where some streams can be blocked infinitly, waiting for a buffer in the
buffer_wq list. This happens because, during a soft reload, pool_gc2 is called,
making some calls to b_alloc_fast fail.

This is specific to threads, no backport is needed.
2017-11-13 11:42:48 +01:00
Christopher Faulet
9dcf9b6f03 MINOR: threads: Use __decl_hathreads to declare locks
This macro should be used to declare variables or struct members depending on
the USE_THREAD compile option. It avoids the encapsulation of such declarations
between #ifdef/#endif. It is used to declare all lock variables.
2017-11-13 11:38:17 +01:00
Willy Tarreau
387bd4f69f CLEANUP: global: introduce variable pid_bit to avoid shifts with relative_pid
At a number of places, bitmasks are used for process affinity and to map
listeners to processes. Every time 1UL<<(relative_pid-1) is used. Let's
create a "pid_bit" variable corresponding to this value to clean this up.
2017-11-10 19:08:14 +01:00
Willy Tarreau
28b55c6fed CLEANUP: mux: remove the unused "release()" function
In commit 53a4766 ("MEDIUM: connection: start to introduce a mux layer
between xprt and data") we introduced a release() function which ends
up never being used. Let's get rid of it now.
2017-11-10 16:43:05 +01:00
Willy Tarreau
aa39860aef MINOR: tools: don't use unlikely() in hex2i()
This small inline function causes some pain to the compiler when used
inside other functions due to its use of the unlikely() hint for non-digits.
It causes the letters to be processed far away in the calling function and
makes the code less efficient. Removing these unlikely() hints has increased
the chunk size parsing by around 5%.
2017-11-10 11:19:54 +01:00
Willy Tarreau
b15e3fefc9 BUG/MEDIUM: h1: ensure the chunk size parser can deal with full buffers
The HTTP/1 code always has the reserve left available so the buffer is
never full there. But with HTTP/2 we have to deal with full buffers,
and it happens that the chunk size parser cannot tell the difference
between a full buffer and an empty one since it compares the start and
the stop pointer.

Let's change this to instead deal with the number of bytes left to process.

As a side effect, this code ends up being about 10% faster than the previous
one, even on HTTP/1.
2017-11-10 11:17:08 +01:00
Christopher Faulet
c5a9d5bf23 BUG/MEDIUM: stream-int: Don't loss write's notifs when a stream is woken up
When a write activity is reported on a channel, it is important to keep this
information for the stream because it take part on the analyzers' triggering.
When some data are written, the flag CF_WRITE_PARTIAL is set. It participates to
the task's timeout updates and to the stream's waking. It is also used in
CF_MASK_ANALYSER mask to trigger channels anaylzers. In the past, it was cleared
by process_stream. Because of a bug (fixed in commit 95fad5ba4 ["BUG/MAJOR:
stream-int: don't re-arm recv if send fails"]), It is now cleared before each
send and in stream_int_notify. So it is possible to loss this information when
process_stream is called, preventing analyzers to be called, and possibly
leading to a stalled stream.

Today, this happens in HTTP2 when you call the stat page or when you use the
cache filter. In fact, this happens when the response is sent by an applet. In
HTTP1, everything seems to work as expected.

To fix the problem, we need to make the difference between the write activity
reported to lower layers and the one reported to the stream. So the flag
CF_WRITE_EVENT has been added to notify the stream of the write activity on a
channel. It is set when a send succedded and reset by process_stream. It is also
used in CF_MASK_ANALYSER. finally, it is checked in stream_int_notify to wake up
a stream and in channel_check_timeouts.

This bug is probably present in 1.7 but it seems to have no effect. So for now,
no needs to backport it.
2017-11-09 15:16:05 +01:00
Willy Tarreau
1b4cf9b754 BUG/MINOR: h1: the HTTP/1 make status code parser check for digits
The H1 parser used by the H2 gateway was a bit lax and could validate
non-numbers in the status code. Since it computes the code on the fly
it's problematic, as "30:" is read as status code 310. Let's properly
check that it's a number now. No backport needed.
2017-11-09 11:15:45 +01:00
Olivier Houchard
522eea7110 MINOR: ssl: Handle sending early data to server.
This adds a new keyword on the "server" line, "allow-0rtt", if set, we'll try
to send early data to the server, as long as the client sent early data, as
in case the server rejects the early data, we no longer have them, and can't
resend them, so the only option we have is to send back a 425, and we need
to be sure the client knows how to interpret it correctly.
2017-11-08 14:11:10 +01:00
Emeric Brun
d8b3b65faa BUG/MEDIUM: splice/threads: pipe reuse list was not protected.
The list is now protected using a global spinlock.
2017-11-07 14:47:28 +01:00
Christopher Faulet
2a944ee16b BUILD: threads: Rename SPIN/RWLOCK macros using HA_ prefix
This remove any name conflicts, especially on Solaris.
2017-11-07 11:10:24 +01:00
Olivier Houchard
55dcdf4c39 BUG/MINOR: dns: Don't try to get the server lock if it's already held.
dns_link_resolution() can be called with the server lock already held, so
don't attempt to lock it again in that case.
2017-11-06 18:34:24 +01:00
Willy Tarreau
88ac59be4d MINOR: threads: use faster locks for the spin locks
The spin locks used to rely on W locks, which involve a loop waiting
for readers to leave, and this doesn't happen here. It's more efficient
to use S locks instead, which are also mutually exclusive and do not
have this loop. This saves one test per spinlock and a few tens of
bytes allowing certain functions to be inlined.
2017-11-06 11:20:11 +01:00
Willy Tarreau
8d38805d3d MAJOR: task: make use of the scope-aware ebtree functions
Currently the task scheduler suffers from an O(n) lookup when
skipping tasks that are not for the current thread. The reason
is that eb32_lookup_ge() has no information about the current
thread so it always revisits many tasks for other threads before
finding its own tasks.

This is particularly visible with HTTP/2 since the number of
concurrent streams created at once causes long series of tasks
for the same stream in the scheduler. With only 10 connections
and 100 streams each, by running on two threads, the performance
drops from 640kreq/s to 11.2kreq/s! Lookup metrics show that for
only 200000 task lookups, 430 million skips had to be performed,
which means that on average, each lookup leads to 2150 nodes to
be visited.

This commit backports the principle of scope lookups for ebtrees
from the ebtree_v7 development tree. The idea is that each node
contains a mask indicating the union of the scopes for the nodes
below it, which is fed during insertion, and used during lookups.

Then during lookups, branches that do not contain any leaf matching
the requested scope are simply ignored. This perfectly matches a
thread mask, allowing a thread to only extract the tasks it cares
about from the run queue, and to always find them in O(log(n))
instead of O(n). Thus the scheduler uses tid_bit and
task->thread_mask as the ebtree scope here.

Doing this has recovered most of the performance, as can be seen on
the test below with two threads, 10 connections, 100 streams each,
and 1 million requests total :

                              Before     After    Gain
              test duration : 89.6s      4.73s     x19
    HTTP requests/s (DEBUG) : 11200     211300     x19
     HTTP requests/s (PROD) : 15900     447000     x28
             spin_lock time : 85.2s      0.46s    /185
            time per lookup : 13us       40ns     /325

Even when going to 6 threads (on 3 hyperthreaded CPU cores), the
performance stays around 284000 req/s, showing that the contention
is much lower.

A test showed that there's no benefit in using this for the wait queue
though.
2017-11-06 11:20:11 +01:00