fd_insert() is currently called just after setting the owner and iocb,
but proceeding like this prevents the operation from being atomic and
requires a lock to protect the maxfd computation in another thread from
meeting an incompletely initialized FD and computing a wrong maxfd.
Fortunately for now all fdtab[].owner are set before calling fd_insert(),
and the first lock in fd_insert() enforces a memory barrier so the code
is safe.
This patch moves the initialization of the owner and iocb to fd_insert()
so that the function will be able to properly arrange its operations and
remain safe even when modified to become lockless. There's no other change
beyond the internal API.
These functions were created for poll() in 1.5-dev18 (commit 80da05a4) to
replace the previous FD_{CLR,SET,ISSET} that were shared with select()
because some libcs enforce a limit on FD_SET. But FD_SET doesn't seem
to be universally MT-safe, requiring locks in the select() code that
are not needed in the poll code. So let's move back to the initial
situation where we used to only use bit fields, since that has been in
use since day one without a problem, and let's use these hap_fd_*
functions instead of FD_*.
This patch only moves the functions to fd.h and revives hap_fd_isset()
that was recently removed to kill an "unused" warning.
Since only select() and poll() still make use of maxfd, let's move
its computation right there in the pollers themselves, and only
during each fd update pass. The computation doesn't need a lock
anymore, only a few atomic ops. It will be accurate, be done much
less often and will not be required anymore in the FD's fast patch.
This provides a small performance increase of about 1% in connection
rate when using epoll since we get rid of this computation which was
performed under a lock.
Some pollers like epoll() need to know if the fd is already known or
not in order to compute the operation to perform (add, mod, del). For
now this is performed based on the difference between the previous FD
state and the new state but this will not be usable anymore once threads
become responsible for their own polling.
Here we come with a different approach : a bitmask is stored with the
fd to indicate which pollers already know it, and the pollers will be
able to simply perform the add/mod/del operations based on this bit
combined with the new state.
This patch only adds the bitmask declaration and initialization, it
is it not yet used. It will be needed by the next two fixes and will
need to be backported to 1.8.
Since the fd update tables are per-thread, we need to have a bit per
thread to indicate whether an update exists, otherwise this can lead
to lost update events every time multiple threads want to update the
same FD. In practice *for now*, it only happens at start time when
listeners are enabled and ask for polling after facing their first
EAGAIN. But since the pollers are still shared, a lost event is still
recovered by a neighbor thread. This will not reliably work anymore
with per-thread pollers, where it has been observed a few times on
startup that a single-threaded listener would not always accept
incoming connections upon startup.
It's worth noting that during this code review it appeared that the
"new" flag in the fdtab isn't used anymore.
This fix should be backported to 1.8.
A bitfield has been added to know if there are some FDs processable by a
specific thread in the FD cache. When a FD is inserted in the FD cache, the bits
corresponding to its thread_mask are set. On each thread, the bitfield is
updated when the FD cache is processed. If there is no FD processed, the thread
is removed from the bitfield by unsetting its tid_bit.
Note that this bitfield is updated but not checked in
fd_process_cached_events. So, when this function is called, the FDs cache is
always processed.
[wt: should be backported to 1.8 as it will help fix a design limitation]
Since commit f9ce57e ("MEDIUM: connection: make conn_sock_shutw() aware
of lingering"), we refrain from performing the shutw() on the socket if
there is no lingering risk. But there is a problem with this in tunnel
and in TCP modes where a client is explicitly allowed to send a shutw
to the server, eventhough it it risky.
Not doing it creates this situation reported by Ricardo Fraile and
diagnosed by Christopher : a typical HTTP client (eg: curl) connecting
via the config below to an HTTP server would receive its response,
immediately close while the server remains in keep-alive mode. The
shutr() received by haproxy from the client is "propagated" to the
server side but not acted upon because fdtab[fd].linger_risk is set,
so we expect that the next close will immediately complete this
operation.
listen proxy-tcp
bind 127.0.0.1:8888
mode tcp
timeout connect 5s
timeout server 10s
timeout client 10s
server server1 127.0.0.1:8000
But since the whole stream will not end until the server closes in
turn, the server doesn't close and haproxy expires on server timeout.
This problem has already struck by waking up an older bug and was
partially fixed with commit 8059351 ("BUG/MEDIUM: http: don't disable
lingering on requests with tunnelled responses") though it was not
enough.
The problem is that linger_risk is not suited here. In fact we need to
know whether or not it is desired to close normally or silently, and
whether or not a shutr() has already been received on this connection.
This is the approach this patch takes, and it solves the problem for
the various difficult modes (tcp, http-server-close, pretend-keepalive).
This fix needs to be backported to 1.8. Many thanks to Ricardo for
providing very detailed traces and configurations.
The new function check_request_for_cacheability() is used to check if
a request may be served from the cache, and/or allows the response to
be stored into the cache. For this it checks the cache-control and
pragma header fields, and adjusts the existing TX_CACHEABLE and a new
TX_CACHE_IGNORE flags.
For now, just like its response side counterpart, it only checks the
first value of the header field. These functions should be reworked to
improve their parsers and validate all elements.
The thread patches adds refcount for notifications. The notifications are
used with the Lua cosocket. These refcount free the notifications when
the session is cleared. In the Lua task case, it not have sessions, so
the nofications are never cleraed.
This patch adds a garbage collector for signals. The garbage collector
just clean the notifications for which the end point is disconnected.
This patch should be backported in 1.8
This BUG was introduced with:
'MEDIUM: threads/stick-tables: handle multithreads on stick tables'
The API was reviewed to handle stick table entry updates
asynchronously and the caller must now call a 'stkable_touch_*'
function each time the content of an entry is modified to
register the entry to be synced.
There was missing call to stktable_touch_* resulting in
not propagated entries to remote peers (or local one during reload)
pendconn_get_next_strm() is called from process_srv_queue() under the
server lock, and calls stream_add_srv_conn() with this lock held, while
the latter tries to take it again. This results in a deadlock when
a server's maxconn is reached and haproxy is built with thread support.
Commit 9dcf9b6 ("MINOR: threads: Use __decl_hathreads to declare locks")
accidently lost a few "extern" in certain lock declarations, possibly
causing certain entries to be declared at multiple places. Apparently
it hasn't caused any harm though.
The offending ones were :
- fdtab_lock
- fdcache_lock
- poll_lock
- buffer_wq_lock
During the migration to the second version of the pools, the new
functions and pool pointers were all called "pool_something2()" and
"pool2_something". Now there's no more pool v1 code and it's a real
pain to still have to deal with this. Let's clean this up now by
removing the "2" everywhere, and by renaming the pool heads
"pool_head_something".
Rename the global variable "proxy" to "proxies_list".
There's been multiple proxies in haproxy for quite some time, and "proxy"
is a potential source of bugs, a number of functions have a "proxy" argument,
and some code used "proxy" when it really meant "px" or "curproxy". It worked
by pure luck, because it usually happened while parsing the config, and thus
"proxy" pointed to the currently parsed proxy, but we should probably not
rely on this.
[wt: some of these are definitely fixes that are worth backporting]
It can happen that we want to read early data, write some, and then continue
reading them.
To do so, we can't reuse tmp_early_data to store the amount of data sent,
so introduce a new member.
If we read early data, then ssl_sock_to_buf() is now the only responsible
for getting back to the handshake, to make sure we don't miss any early data.
a bitfield has been added to know if there are runnable applets for a
thread. When an applet is woken up, the bits corresponding to its thread_mask
are set. When all active applets for a thread is get to be processed, the thread
is removed from active ones by unsetting its tid_bit from the bitfield.
a bitfield has been added to know if there are runnable tasks for a thread. When
a task is woken up, the bits corresponding to its thread_mask are set. When all
tasks for a thread have been evaluated without any wakeup, the thread is removed
from active ones by unsetting its tid_bit from the bitfield.
At the end of the master initialisation, a call to protocol_unbind_all()
was made, in order to close all the FDs.
Unfortunately, this function closes the inherited FDs (fd@), upon reload
the master wasn't able to reload a configuration with those FDs.
The create_listeners() function now store a flag to specify if the fd
was inherited or not.
Replace the protocol_unbind_all() by mworker_cleanlisteners() +
deinit_pollers()
This macro should be used to declare variables or struct members depending on
the USE_THREAD compile option. It avoids the encapsulation of such declarations
between #ifdef/#endif. It is used to declare all lock variables.
The HTTP/1 code always has the reserve left available so the buffer is
never full there. But with HTTP/2 we have to deal with full buffers,
and it happens that the chunk size parser cannot tell the difference
between a full buffer and an empty one since it compares the start and
the stop pointer.
Let's change this to instead deal with the number of bytes left to process.
As a side effect, this code ends up being about 10% faster than the previous
one, even on HTTP/1.
When a write activity is reported on a channel, it is important to keep this
information for the stream because it take part on the analyzers' triggering.
When some data are written, the flag CF_WRITE_PARTIAL is set. It participates to
the task's timeout updates and to the stream's waking. It is also used in
CF_MASK_ANALYSER mask to trigger channels anaylzers. In the past, it was cleared
by process_stream. Because of a bug (fixed in commit 95fad5ba4 ["BUG/MAJOR:
stream-int: don't re-arm recv if send fails"]), It is now cleared before each
send and in stream_int_notify. So it is possible to loss this information when
process_stream is called, preventing analyzers to be called, and possibly
leading to a stalled stream.
Today, this happens in HTTP2 when you call the stat page or when you use the
cache filter. In fact, this happens when the response is sent by an applet. In
HTTP1, everything seems to work as expected.
To fix the problem, we need to make the difference between the write activity
reported to lower layers and the one reported to the stream. So the flag
CF_WRITE_EVENT has been added to notify the stream of the write activity on a
channel. It is set when a send succedded and reset by process_stream. It is also
used in CF_MASK_ANALYSER. finally, it is checked in stream_int_notify to wake up
a stream and in channel_check_timeouts.
This bug is probably present in 1.7 but it seems to have no effect. So for now,
no needs to backport it.
The H1 parser used by the H2 gateway was a bit lax and could validate
non-numbers in the status code. Since it computes the code on the fly
it's problematic, as "30:" is read as status code 310. Let's properly
check that it's a number now. No backport needed.
Currently the task scheduler suffers from an O(n) lookup when
skipping tasks that are not for the current thread. The reason
is that eb32_lookup_ge() has no information about the current
thread so it always revisits many tasks for other threads before
finding its own tasks.
This is particularly visible with HTTP/2 since the number of
concurrent streams created at once causes long series of tasks
for the same stream in the scheduler. With only 10 connections
and 100 streams each, by running on two threads, the performance
drops from 640kreq/s to 11.2kreq/s! Lookup metrics show that for
only 200000 task lookups, 430 million skips had to be performed,
which means that on average, each lookup leads to 2150 nodes to
be visited.
This commit backports the principle of scope lookups for ebtrees
from the ebtree_v7 development tree. The idea is that each node
contains a mask indicating the union of the scopes for the nodes
below it, which is fed during insertion, and used during lookups.
Then during lookups, branches that do not contain any leaf matching
the requested scope are simply ignored. This perfectly matches a
thread mask, allowing a thread to only extract the tasks it cares
about from the run queue, and to always find them in O(log(n))
instead of O(n). Thus the scheduler uses tid_bit and
task->thread_mask as the ebtree scope here.
Doing this has recovered most of the performance, as can be seen on
the test below with two threads, 10 connections, 100 streams each,
and 1 million requests total :
Before After Gain
test duration : 89.6s 4.73s x19
HTTP requests/s (DEBUG) : 11200 211300 x19
HTTP requests/s (PROD) : 15900 447000 x28
spin_lock time : 85.2s 0.46s /185
time per lookup : 13us 40ns /325
Even when going to 6 threads (on 3 hyperthreaded CPU cores), the
performance stays around 284000 req/s, showing that the contention
is much lower.
A test showed that there's no benefit in using this for the wait queue
though.
The __appctx_wakeup() function already does it. It matters with threads
enabled because it simplifies the code in appctx_res_wakeup() to get rid
of this test.
unbind_listener() takes the listener lock, which is already held by
enable_listener(). This situation happens when starting with nbproc > 1
with some bind lines limited to a certain process, because in this case
enable_listener() tries to stop unneeded listeners.
This commit introduces __do_unbind_listeners() which must be called with
the lock held, and makes enable_listener() use this one. Given that the
only return code has never been used and that it starts to make the code
more complicated to propagate it before throwing it to the trash, the
function's return type was changed to void.
This callback will be used to release upper layers when a mux is in
use. Given that the mux can be asynchronously deleted, we need a way
to release the extra information such as the session.
This callback will be called directly by the mux upon releasing
everything and before the connection itself is released, so that
the callee can find its information inside the connection if needed.
The way it currently works is not perfect, and most likely this should
instead become a mux release callback, but for now we have no easy way
to add mux-specific stuff, and since there's one mux per connection,
it works fine this way.
For H2, only the mux's timeout or other conditions might cause a
release of the mux and the connection, no stream should be allowed
to kill such a shared connection. So a stream will only detach using
cs_destroy() which will call mux->detach() then free the cs.
For now it's only handled by mux_pt. The goal is that the data layer
never has to care about the connection, which will have to be released
depending on the mux's mood.
This basically calls cs_shutw() followed by cs_shutr(). Both of them
are called in the most conservative mode so that any previous call is
still respected. The CS flags are cleared so that it can be reused
(this is important for connection retries when conn and CS are reused
without being reallocated).
All the references to connections in the data path from streams and
stream_interfaces were changed to use conn_streams. Most functions named
"something_conn" were renamed to "something_cs" for this. Sometimes the
connection still is what matters (eg during a connection establishment)
and were not always renamed. The change is significant and minimal at the
same time, and was quite thoroughly tested now. As of this patch, all
accesses to the connection from upper layers go through the pass-through
mux.
Most of the functions dealing with conn_streams are here. They act at
the data layer and interact with the mux. For now they are not used yet
but everything builds.
This patch introduces a new struct conn_stream. It's the stream-side of
a multiplexed connection. A pool is created and destroyed on exit. For
now the conn_streams are not used at all.
When an incoming connection is made on an HTTP mode frontend, the
session now looks up the mux to use based on the ALPN token and the
proxy mode. This will allow easier mux registration, and we don't
need to hard-code the mux_pt_ops anymore.
Selecting a mux based on ALPN and the proxy mode will quickly become a
pain. This commit provides new functions to register/lookup a mux based
on the ALPN string and the proxy mode to make this easier. Given that
we're not supposed to support a wide range of muxes, the lookup should
not have any measurable performance impact.
For HTTP/2 and QUIC, we'll need to deal with multiplexed streams inside
a connection. After quite a long brainstorming, it appears that the
connection interface to the existing streams is appropriate just like
the connection interface to the lower layers. In fact we need to have
the mux layer in the middle of the connection, between the transport
and the data layer.
A mux can exist on two directions/sides. On the inbound direction, it
instanciates new streams from incoming connections, while on the outbound
direction it muxes streams into outgoing connections. The difference is
visible on the mux->init() call : in one case, an upper context is already
known (outgoing connection), and in the other case, the upper context is
not yet known (incoming connection) and will have to be allocated by the
mux. The session doesn't have to create the new streams anymore, as this
is performed by the mux itself.
This patch introduces this and creates a pass-through mux called
"mux_pt" which is used for all new connections and which only
calls the data layer's recv,send,wake() calls. One incoming stream
is immediately created when init() is called on the inbound direction.
There should not be any visible impact.
Note that the connection's mux is purposely not set until the session
is completed so that we don't accidently run with the wrong mux. This
must not cause any issue as the xprt_done_cb function is always called
prior to using mux's recv/send functions.
This is needed in the H2->H1 gateway so that we know how long the trailers
block is in chunked encoding. It returns the number of bytes, or 0 if some
are missing, or -1 in case of parse error.
It was a leftover from the last cleaning session; this mask applies
to threads and calling it process_mask is a bit confusing. It's the
same in fd, task and applets.
srv_set_fqdn() may be called with the DNS lock already held, but tries to
lock it anyway. So, add a new parameter to let it know if it was already
locked or not;
Commit 819fc6f ("MEDIUM: threads/stick-tables: handle multithreads on
stick tables") introduced a valid warning about an uninitialized return
value in stksess_kill_if_expired(). It just happens that this result is
never used, so let's turn the function back to void as previously.
The wrong bit was set to keep the lock on freq counter update. And the read
functions were re-worked to use volatile.
Moreover, when a freq counter is updated, it is now rotated only if the current
counter is in the past (now.tv_sec > ctr->curr_sec). It is important with
threads because the current time (now) is thread-local. So, rounded to the
second, the time may vary by more or less 1 second. So a freq counter rotated by
one thread may be see 1 second in the future. In this case, it is updated but
not rotated.