When the block of data need to be split to support the wrapping, the start of
the second block of data was wrong. We must be sure to skip data copied during
the first memcpy.
This patch must be backported to 1.8, 1.7, 1.6 and 1.5.
Since we use padding before the allocated page, it's trivial to place
the allocated address there and see if it gets mangled once we release
it.
This may be backported to stable releases already using DEBUG_UAF.
Commit 158fa75 ("MINOR: pools: implement DEBUG_UAF to detect use after free")
implemented pool use-after-free detection, but the mmap() return value isn't
properly checked, preventing the call to pool_alloc_area() from returning
NULL. So on out-of-memory a mangled pointer is returned, causing a crash on
the pool_alloc() site instead of forcing a GC. It doesn't affect regular
operations however, just complicates complex bug investigations.
This fix should be backported to 1.8 and to 1.7.
Since commit cf975d4 ("MINOR: pools/threads: Implement lockless memory
pools."), we support lockless pools. However the parts dedicated to
detecting use-after-free are not present in this part, making DEBUG_UAF
useless in this situation.
The present patch sets a new define CONFIG_HAP_LOCKLESS_POOLS when such
a compatible architecture is detected, and when pool debugging is not
requested, then makes use of this everywhere in pools and buffers
functions. This way enabling DEBUG_UAF will automatically disable the
lockless version.
No backport is needed as this is purely 1.9-dev.
This removes the end label from memory.h.
The labels are unused as of cf975d46bc
which is unreleased (and incidentally the first commit containing
those labels, thus they never have been used).
A TLS ticket keys file can be updated on the CLI and used in same time. So we
need to protect it to be sure all accesses are thread-safe. Because updates are
infrequent, a R/W lock has been used.
This patch must be backported in 1.8
Commit f61f0cb ("MINOR: threads: Introduce double-width CAS on x86_64
and arm.") introduced the double CAS. But the ARMv7 version is bogus,
it uses the value of the pointers instead of dereferencing them. When
lucky, it simply doesn't build due to impossible registers combinations.
Otherwise it will immediately crash at run time when facing traffic.
No backport is needed, this bug was introduced in 1.9-dev.
Each fd_{may|cant|stop|want}_{recv|send} function sets or resets a
single bit at once, then recomputes the need for updates, and then
the new cache state. Later, pollers will compute the new polling
state based on the resulting operations here. In fact the conditions
are so simple that they can be performed by a single "if", or sometimes
even optimized away.
This means that in practice a simple compare-and-swap operation if often
enough to set the new value inluding the new polling state, and that only
the cache and fdupdt have to be performed under the lock. Better, for the
most common operations (fd_may_{recv,send}, used by the pollers), a simple
atomic OR is needed.
This patch does this for the fd_* functions above and it doesn't yet
remove the now useless fd_compute_new_polling_status() because it's still
used by other pollers. A pure connection rate test shows a 1% performance
increase.
An fd cache entry might be removed and added at the end of the list, while
another thread is parsing it, if that happens, we may miss fd cache entries,
to avoid that, add a new field in the struct fdtab, "added_mask", which
contains a mask for potentially affected threads, if it is set, the
corresponding thread will set its bit in fd_cache_mask, to avoid waiting in
poll while it may have more work to do.
Create a local, per-thread, fdcache, for file descriptors that only belongs
to one thread, and make the global fd cache mostly lockless, as we can get
a lot of contention on the fd cache lock.
Instead of looking for CO_FL_EARLY_DATA to know if we have to try to wake
up a stream, because it is waiting for a SSL handshake, instead add a new
conn_stream flag, CS_FL_WAIT_FOR_HS. This way we don't have to rely on
CO_FL_EARLY_DATA, and we will only wake streams that are actually waiting.
Instead of using a list of applets with idle ones in front, we now use an
ebtree. Aapplets in the tree are idle by definition. And the key is the applet's
weight. When a new frame is queued, the first idle applet (with the lowest
weight) is woken up and its weight is increased by one. And when an applet sends
a frame to a SPOA, its weight is decremented by one.
This is empirical, but it should avoid to overuse a very few number of applets
and increase the balancing between idle applets.
So it is easier to respect the max_fpa value. This is no more the maximum frames
processed by an applet at each loop but the maximum frames waiting for an ack
for a specific applet.
The function spoe_handle_processing_appctx has been rewritten accordingly.
sending_rate was a counter used to evaluate the SPOE capacity to process
frames. Because it was not really accurrate, it has been replaced by a frequency
counter representing the number of frames handled by the SPOE per second. We
just check this counter is higher than the number of streams waiting for a
reply. If not, a new applet is created.
The calculation of a minimal number of active applets was really empirical and
finally useless. On heavy load, there are always many active applets (most of
time, more than the minimal required) and when the load is low, there is no
reason to keep unused applets opened.
Because of this change, the flag SPOE_APPCTX_FL_PERSIST is now unused. So it has
been removed.
Recent changes to the enum were not synchronized with the lock debugging
code. Now we use a switch/case instead of an array so that the compiler
throws a warning if there is any inconsistency.
To be backported to 1.8 (at least to add the START entry).
fd_insert() is currently called just after setting the owner and iocb,
but proceeding like this prevents the operation from being atomic and
requires a lock to protect the maxfd computation in another thread from
meeting an incompletely initialized FD and computing a wrong maxfd.
Fortunately for now all fdtab[].owner are set before calling fd_insert(),
and the first lock in fd_insert() enforces a memory barrier so the code
is safe.
This patch moves the initialization of the owner and iocb to fd_insert()
so that the function will be able to properly arrange its operations and
remain safe even when modified to become lockless. There's no other change
beyond the internal API.
These functions were created for poll() in 1.5-dev18 (commit 80da05a4) to
replace the previous FD_{CLR,SET,ISSET} that were shared with select()
because some libcs enforce a limit on FD_SET. But FD_SET doesn't seem
to be universally MT-safe, requiring locks in the select() code that
are not needed in the poll code. So let's move back to the initial
situation where we used to only use bit fields, since that has been in
use since day one without a problem, and let's use these hap_fd_*
functions instead of FD_*.
This patch only moves the functions to fd.h and revives hap_fd_isset()
that was recently removed to kill an "unused" warning.
Since only select() and poll() still make use of maxfd, let's move
its computation right there in the pollers themselves, and only
during each fd update pass. The computation doesn't need a lock
anymore, only a few atomic ops. It will be accurate, be done much
less often and will not be required anymore in the FD's fast patch.
This provides a small performance increase of about 1% in connection
rate when using epoll since we get rid of this computation which was
performed under a lock.
The incorrect comment was introduced in commit:
2ac5718dbd
v1.5-dev9 is the first tag containing this comment, the fix
should be backported to haproxy 1.5 and newer.
Marc Fournier reported an interesting case when using threads with the
master-worker mode : sometimes, a listener would have its FD closed
during startup. Sometimes it could even be health checks seeing this.
What happens is that after the threads are created, and the pollers
enabled on each threads, the master-worker pipe is registered, and at
the same time a close() is performed on the write side of this pipe
since the children must not use it.
But since this is replicated in every thread, what happens is that the
first thread closes the pipe, thus releases the FD, and the next thread
starting a listener in parallel gets this FD reassigned. Then another
thread closes the FD again, which this time corresponds to the listener.
It can also happen with the health check sockets if they're started
early enough.
This patch splits the mworker_pipe_register() function in two, so that
the close() of the write side of the FD is performed very early after the
fork() and long before threads are created (we don't need to delay it
anyway). Only the pipe registration is done in the threaded code since
it is important that the pollers are properly allocated for this.
The mworker_pipe_register() function now takes care of registering the
pipe only once, and this is guaranteed by a new surrounding lock.
The call to protocol_enable_all() looks fragile in theory since it
scans the list of proxies and their listeners, though in practice
all threads scan the same list and take the same locks for each
listener so it's not possible that any of them escapes the process
and finishes before all listeners are started. And the operation is
idempotent.
This fix must be backported to 1.8. Thanks to Marc for providing very
detailed traces clearly showing the problem.
Some pollers like epoll() need to know if the fd is already known or
not in order to compute the operation to perform (add, mod, del). For
now this is performed based on the difference between the previous FD
state and the new state but this will not be usable anymore once threads
become responsible for their own polling.
Here we come with a different approach : a bitmask is stored with the
fd to indicate which pollers already know it, and the pollers will be
able to simply perform the add/mod/del operations based on this bit
combined with the new state.
This patch only adds the bitmask declaration and initialization, it
is it not yet used. It will be needed by the next two fixes and will
need to be backported to 1.8.
Since the fd update tables are per-thread, we need to have a bit per
thread to indicate whether an update exists, otherwise this can lead
to lost update events every time multiple threads want to update the
same FD. In practice *for now*, it only happens at start time when
listeners are enabled and ask for polling after facing their first
EAGAIN. But since the pollers are still shared, a lost event is still
recovered by a neighbor thread. This will not reliably work anymore
with per-thread pollers, where it has been observed a few times on
startup that a single-threaded listener would not always accept
incoming connections upon startup.
It's worth noting that during this code review it appeared that the
"new" flag in the fdtab isn't used anymore.
This fix should be backported to 1.8.
A bitfield has been added to know if there are some FDs processable by a
specific thread in the FD cache. When a FD is inserted in the FD cache, the bits
corresponding to its thread_mask are set. On each thread, the bitfield is
updated when the FD cache is processed. If there is no FD processed, the thread
is removed from the bitfield by unsetting its tid_bit.
Note that this bitfield is updated but not checked in
fd_process_cached_events. So, when this function is called, the FDs cache is
always processed.
[wt: should be backported to 1.8 as it will help fix a design limitation]
A number of counters have been added at special places helping better
understanding certain bug reports. These counters are maintained per
thread and are shown using "show activity" on the CLI. The "clear
counters" commands also reset these counters. The output is sent as a
single write(), which currently produces up to about 7 kB of data for
64 threads. If more counters are added, it may be necessary to write
into multiple buffers, or to reset the counters.
To backport to 1.8 to help collect more detailed bug reports.
This one allows not to inflate some structures when threads are
disabled. Now struct global is 1.4 kB instead of 33 kB.
Should be backported to 1.8 for ease of backporting of upcoming
patches.
The "thread" part is 32kB long, better move it at the end of the
structure since it's only used during initialization, to keep the
rest grouped together.
Should be backported to 1.8 to ease backporting of upcoming patches,
no functional impact.
In addition to "option force-set-var", recently added, this directive can be
used to selectivelly register unknown variable names, without totally relaxing
their registration during the runtime, like "option force-set-var" does.
So there is no way for a malicious agent to exhaust memory by defining a too
high number of variable names. In other hand, you need to enumerate all
variable names. This could be painfull in some circumstances.
Remember, this directive is only usefull when the variable names are not
referenced anywhere in the HAProxy configuration or the SPOE one.
Thanks to Etienne Carrire for his help on this part.
Till now the use of __atomic_* gcc builtins required gcc >= 4.7. Since
some supported and quite common operating systems like CentOS 6 still
come with older versions (4.4) and the mapping to the older builtins
is reasonably simple, let's implement it.
This code is only used for gcc < 4.7. It has been quickly tested on a
machine using gcc 4.4.4 and provided expected results.
This patch should be backported to 1.8.
A SRV record weight can range from 0 to 65535, while haproxy weight goes
from 0 to 256, so we have to divide it by 256 before handing it to haproxy.
Also, a SRV record with a weight of 0 doesn't mean the server shouldn't be
used, so use a minimum weight of 1.
This should probably be backported to 1.8.
Since commit f9ce57e ("MEDIUM: connection: make conn_sock_shutw() aware
of lingering"), we refrain from performing the shutw() on the socket if
there is no lingering risk. But there is a problem with this in tunnel
and in TCP modes where a client is explicitly allowed to send a shutw
to the server, eventhough it it risky.
Not doing it creates this situation reported by Ricardo Fraile and
diagnosed by Christopher : a typical HTTP client (eg: curl) connecting
via the config below to an HTTP server would receive its response,
immediately close while the server remains in keep-alive mode. The
shutr() received by haproxy from the client is "propagated" to the
server side but not acted upon because fdtab[fd].linger_risk is set,
so we expect that the next close will immediately complete this
operation.
listen proxy-tcp
bind 127.0.0.1:8888
mode tcp
timeout connect 5s
timeout server 10s
timeout client 10s
server server1 127.0.0.1:8000
But since the whole stream will not end until the server closes in
turn, the server doesn't close and haproxy expires on server timeout.
This problem has already struck by waking up an older bug and was
partially fixed with commit 8059351 ("BUG/MEDIUM: http: don't disable
lingering on requests with tunnelled responses") though it was not
enough.
The problem is that linger_risk is not suited here. In fact we need to
know whether or not it is desired to close normally or silently, and
whether or not a shutr() has already been received on this connection.
This is the approach this patch takes, and it solves the problem for
the various difficult modes (tcp, http-server-close, pretend-keepalive).
This fix needs to be backported to 1.8. Many thanks to Ricardo for
providing very detailed traces and configurations.
The new function check_request_for_cacheability() is used to check if
a request may be served from the cache, and/or allows the response to
be stored into the cache. For this it checks the cache-control and
pragma header fields, and adjusts the existing TX_CACHEABLE and a new
TX_CACHE_IGNORE flags.
For now, just like its response side counterpart, it only checks the
first value of the header field. These functions should be reworked to
improve their parsers and validate all elements.
By copying the info in the stream interface that the mux cleanly reports
aborts, we'll have the ability to check this flag wherever needed regardless
of the presence of a mux or not.
This new field will be used to describe certain properties of some
muxes. For now we only add MX_FL_CLEAN_ABRT to indicate that a mux
is able to unambiguously report aborts using CS_FL_ERROR contrary
to others who may only report it via a read0. This will be used to
improve handling of the abortonclose option with H2. Other flags
may come later to report multiplexing capabilities or not, support
of client/server sides etc.
For security reasons, the spoe filter was only able to change values of
existing variables. In specific cases (ex : with LUA code), the name of
variables are unknown at the configuration parsing phase.
The force-set-var option can be enabled to register all variables.
Due to the nature of multiplexed protocols, it will often happen that
some operations are only performed on full frames, preventing any partial
operation from being performed. HTTP/2 is one such example. The current
MUX API causes a problem here because the rcv_buf() function has no way
to let the stream layer know that some data could not be read due to a
lack of room in the buffer, but that data are definitely present. The
problem with this is that the stream layer might not know it needs to
call the function again after it has made some room. And if the frame
in the buffer is not followed by any other, nothing will move anymore.
This patch introduces a new conn_stream flag CS_FL_RCV_MORE whose purpose
is to indicate on the stream that more data than what was received are
already available for reading as soon as more room will be available in
the buffer.
This patch doesn't make use of this flag yet, it only declares it. It is
expected that other similar flags may come in the future, such as reports
of pending end of stream, errors or any such event that might save the
caller from having to poll, or simply let it know that it can take some
actions after having processed data.
The thread patches adds refcount for notifications. The notifications are
used with the Lua cosocket. These refcount free the notifications when
the session is cleared. In the Lua task case, it not have sessions, so
the nofications are never cleraed.
This patch adds a garbage collector for signals. The garbage collector
just clean the notifications for which the end point is disconnected.
This patch should be backported in 1.8
The number of async fd is computed considering the maxconn, the number
of sides using ssl and the number of engines using async mode.
This patch should be backported on haproxy 1.8
In hpack_dht_make_room(), we try to fulfill this rule form RFC7541#4.4 :
"It is not an error to attempt to add an entry that is larger than the
maximum size; an attempt to add an entry larger than the maximum size
causes the table to be emptied of all existing entries and results in
an empty table."
Unfortunately it is not consistent with the way it's used in
hpack_dht_insert() as this last one will consider a success as a
confirmation it can copy the header into the table, and a failure as
an indexing error. This results in the two following issues :
- if a client sends too large a header into an empty table, this
header may overflow the table. Fortunately, most clients send
small headers like :authority first, and never mark headers that
don't fit into the table as indexable since it is counter-productive ;
- if a client sends too large a header into a populated table, the
operation fails after the table is totally flushed and the request
is not processed.
This patch fixes the two issues at once :
- a header not fitting into an empty table is always a sign that it
will never fit ;
- not fitting into the table is not an error
Thanks to Yves Lafon for reporting detailed traces demonstrating this
issue. This fix must be backported to 1.8.
If the hpack decoder sees an invalid header index, it emits value
"### ERR ###" that was used during debugging instead of rejecting the
block. This is harmless, and was detected by h2spec.
To backport to 1.8.
This BUG was introduced with:
'MEDIUM: threads/stick-tables: handle multithreads on stick tables'
The API was reviewed to handle stick table entry updates
asynchronously and the caller must now call a 'stkable_touch_*'
function each time the content of an entry is modified to
register the entry to be synced.
There was missing call to stktable_touch_* resulting in
not propagated entries to remote peers (or local one during reload)
server.h needs checks.h since it references the struct check, but depending
on the include order it will fail if check.h is included first due to this
one including server.h in turn while it doesn't need it.
Released version 1.9-dev0 with the following main changes :
- BUG/MEDIUM: stream: don't automatically forward connect nor close
- BUG/MAJOR: stream: ensure analysers are always called upon close
- BUG/MINOR: stream-int: don't try to read again when CF_READ_DONTWAIT is set
- MEDIUM: mworker: Add systemd `Type=notify` support
- BUG/MEDIUM: cache: free callback to remove from tree
- CLEANUP: cache: remove unused struct
- MEDIUM: cache: enable the HTTP analysers
- CLEANUP: cache: remove wrong comment
- MINOR: threads/atomic: rename local variables in macros to avoid conflicts
- MINOR: threads/plock: rename local variables in macros to avoid conflicts
- MINOR: threads/atomic: implement pl_mb() in asm on x86
- MINOR: threads/atomic: implement pl_bts() on non-x86
- MINOR: threads/build: atomic: replace the few inlines with macros
- BUILD: threads/plock: fix a build issue on Clang without optimization
- BUILD: ebtree: don't redefine types u32/s32 in scope-aware trees
- BUILD: compiler: add a new type modifier __maybe_unused
- BUILD: h2: mark some inlined functions "unused"
- BUILD: server: check->desc always exists
- BUG/MEDIUM: h2: properly report connection errors in headers and data handlers
- MEDIUM: h2: add a function to emit an HTTP/1 request from a headers list
- MEDIUM: h2: change hpack_decode_headers() to only provide a list of headers
- BUG/MEDIUM: h2: always reassemble the Cookie request header field
- BUG/MINOR: systemd: ignore daemon mode
- CONTRIB: spoa_example: allow to compile outside HAProxy.
- CONTRIB: spoa_example: remove bref, wordlist, cond_wordlist
- CONTRIB: spoa_example: remove last dependencies on type "sample"
- CONTRIB: spoa_example: remove SPOE enums that are useless for clients
- CLEANUP: cache: reorder includes
- MEDIUM: shctx: use unsigned int for len and block_count
- MEDIUM: cache: "show cache" on the cli
- BUG/MEDIUM: cache: use key=0 as a condition for freeing
- BUG/MEDIUM: cache: refcount forbids to free the objects
- BUG/MEDIUM: cache fix cli_kws structure
- BUG/MEDIUM: deinit: correctly deinitialize the proxy and global listener tasks
- BUG/MINOR: ssl: Always start the handshake if we can't send early data.
- MINOR: ssl: Don't disable early data handling if we could not write.
- MINOR: pools: prepare functions to override malloc/free in pools
- MINOR: pools: implement DEBUG_UAF to detect use after free
- BUG/MEDIUM: threads/time: fix time drift correction
- BUG/MEDIUM: threads/time: maintain a common time reference between all threads
- MINOR: sample: Add "thread" sample fetch
- BUG/MINOR: Use crt_base instead of ca_base when crt is parsed on a server line
- BUG/MINOR: stream: fix tv_request calculation for applets
- BUG/MAJOR: h2: always remove a stream from the send list before freeing it
- BUG/MAJOR: threads/task: dequeue expired tasks under the WQ lock
- MINOR: ssl: Handle reading early data after writing better.
- MINOR: mux: Make sure every string is woken up after the handshake.
- MEDIUM: cache: store sha1 for hashing the cache key
- MINOR: http: implement the "http-request reject" rule
- MINOR: h2: send RST_STREAM before GOAWAY on reject
- MEDIUM: h2: don't gracefully close the connection anymore on Connection: close
- MINOR: h2: make use of client-fin timeout after GOAWAY
- MEDIUM: config: ensure that tune.bufsize is at least 16384 when using HTTP/2
- MINOR: ssl: Handle early data with BoringSSL
- BUG/MEDIUM: stream: always release the stream-interface on abort
- BUG/MEDIUM: cache: free ressources in chn_end_analyze
- MINOR: cache: move the refcount decrease in the applet release
- BUG/MINOR: listener: Allow multiple "process" options on "bind" lines
- MINOR: config: Support a range to specify processes in "cpu-map" parameter
- MINOR: config: Slightly change how parse_process_number works
- MINOR: config: Export parse_process_number and use it wherever it's applicable
- MINOR: standard: Add my_ffsl function to get the position of the bit set to one
- MINOR: config: Add auto-increment feature for cpu-map
- MINOR: config: Support partial ranges in cpu-map directive
- MINOR:: config: Remove thread-map directive
- MINOR: config: Add the threads support in cpu-map directive
- MINOR: config: Add threads support for "process" option on "bind" lines
- MEDIUM: listener: Bind listeners on a thread subset if specified
- CLEANUP: debug: Use DPRINTF instead of fprintf into #ifdef DEBUG_FULL/#endif
- CLEANUP: log: Rename Alert/Warning in ha_alert/ha_warning
- MINOR/CLEANUP: proxy: rename "proxy" to "proxies_list"
- CLEANUP: pools: rename all pool functions and pointers to remove this "2"
- DOC: update the roadmap file with the latest changes merged in 1.8
- DOC: fix mangled version in peers protocol documentation
- DOC: add initial peers protovol v2.0 documentation.
- DOC: mention William as maintainer of the cache and master-worker
- DOC: add Christopher and Emeric as maintainers of the threads
- MINOR: cache: replace a fprint() by an abort()
- MEDIUM: cache: max-age configuration keyword
- DOC: explain HTTP2 timeout behavior
- DOC: cache: configuration and management
- MAJOR: mworker: exits the master on failure
- BUG/MINOR: threads: don't drop "extern" on the lock in include files
- MINOR: task: keep a pointer to the currently running task
- MINOR: task: align the rq and wq locks
- MINOR: fd: cache-align fdtab and fdcache locks
- MINOR: buffers: cache-align buffer_wq_lock
- CLEANUP: server: reorder some fields in struct server to save 40 bytes
- CLEANUP: proxy: slightly reorder the struct proxy to reduce holes
- CLEANUP: checks: remove 16 bytes of holes in struct check
- CLEANUP: cache: more efficiently pack the struct cache
- CLEANUP: fd: place the lock at the beginning of struct fdtab
- CLEANUP: pools: align pools on a cache line
- DOC: config: add a few bits about how to configure HTTP/2
- BUG/MAJOR: threads/queue: avoid recursive locking in pendconn_get_next_strm()
- BUILD: Makefile: reorder object files by size
pendconn_get_next_strm() is called from process_srv_queue() under the
server lock, and calls stream_add_srv_conn() with this lock held, while
the latter tries to take it again. This results in a deadlock when
a server's maxconn is reached and haproxy is built with thread support.
There are just a few pools, and they're stressed a lot, so it makes
sense to dedicate them a cache line to avoid contention and to place
the lock at the beginning.
The struct is not cache line aligned but at least, every time the lock
will appear in the same cache line as the fd it will benefit from being
accessed first. This improves the performance by about 2% on fd-intensive
workloads with 4 threads.
Commit 9dcf9b6 ("MINOR: threads: Use __decl_hathreads to declare locks")
accidently lost a few "extern" in certain lock declarations, possibly
causing certain entries to be declared at multiple places. Apparently
it hasn't caused any harm though.
The offending ones were :
- fdtab_lock
- fdcache_lock
- poll_lock
- buffer_wq_lock
This patch changes the behavior of the master during the exit of a
worker.
When a worker exits with an error code, for example in the case of a
segfault, all workers are now killed and the master leaves.
If you don't want this behavior you can use the option
"master-worker no-exit-on-failure".
During the migration to the second version of the pools, the new
functions and pool pointers were all called "pool_something2()" and
"pool2_something". Now there's no more pool v1 code and it's a real
pain to still have to deal with this. Let's clean this up now by
removing the "2" everywhere, and by renaming the pool heads
"pool_head_something".
Rename the global variable "proxy" to "proxies_list".
There's been multiple proxies in haproxy for quite some time, and "proxy"
is a potential source of bugs, a number of functions have a "proxy" argument,
and some code used "proxy" when it really meant "px" or "curproxy". It worked
by pure luck, because it usually happened while parsing the config, and thus
"proxy" pointed to the currently parsed proxy, but we should probably not
rely on this.
[wt: some of these are definitely fixes that are worth backporting]
It is now possible on a "bind" line (or a "stats socket" line) to specify the
thread set allowed to process listener's connections. For instance:
# HTTPS connections will be processed by all threads but the first and HTTP
# connection will be processed on the first thread.
bind *:80 process 1/1
bind *:443 ssl crt mycert.pem process 1/2-
Now, it is possible to bind CPU at the thread level instead of the process level
by defining a thread set in "cpu-map" directives. Thus, its format is now:
cpu-map [auto:]<process-set>[/<thread-set>] <cpu-set>...
where <process-set> and <thread-set> must follow the format:
all | odd | even | number[-[number]]
Having a process range and a thread range in same time with the "auto:" prefix
is not supported. Only one range is supported, the other one must be a fixed
number. But it is allowed when there is no "auto:" prefix.
Because it is possible to define a mapping for a process and another for a
thread on this process, threads will be bound on the intersection of their
mapping and the one of the process on which they are attached. If the
intersection is null, no specific binding will be set for the threads.
The prefix "auto:" can be added before the process set to let HAProxy
automatically bind a process to a CPU by incrementing process and CPU sets. To
be valid, both sets must have the same size. No matter the declaration order of
the CPU sets, it will be bound from the lower to the higher bound.
Examples:
# all these lines bind the process 1 to the cpu 0, the process 2 to cpu 1
# and so on.
cpu-map auto:1-4 0-3
cpu-map auto:1-4 0-1 2-3
cpu-map auto:1-4 3 2 1 0
# bind each process to exaclty one CPU using all/odd/even keyword
cpu-map auto:all 0-63
cpu-map auto:even 0-31
cpu-map auto:odd 32-63
# invalid cpu-map because process and CPU sets have different sizes.
cpu-map auto:1-4 0 # invalid
cpu-map auto:1 0-3 # invalid
The cache was relying on the txn->uri for creating its key, which was a
big problem when there was no log activated.
This patch does a sha1 of the host + uri, and stores it in the txn.
When a object is stored, the eb32node uses the first 32 bits of the hash
as a key, and the whole hash is stored in the cache entry.
During a lookup, the truncated hash is used, and when it matches an
entry we check the real sha1.
It can happen that we want to read early data, write some, and then continue
reading them.
To do so, we can't reuse tmp_early_data to store the amount of data sent,
so introduce a new member.
If we read early data, then ssl_sock_to_buf() is now the only responsible
for getting back to the handshake, to make sure we don't miss any early data.
This code has been used successfully a few times in the past to detect
that a pool was used after being freed. Its main goal is to allocate a
full page for each object so that they are always released individually
and unmapped from memory. This way if any part of the code reference the
object after is was freed and before it is reallocated, a segv occurs at
the exact offending location. It does a few extra things such as writing
to the memory area before freeing to detect double-frees and free of
read-only areas, and placing the data at the end of the page instead of
the beginning so that out of bounds accesses are easier to spot. The
amount of memory used with this is huge (about 10 times the regular
usage) but it can be useful sometimes.
Allows bigger objects to be cached in the shctx, the first
implementation was only storing small ssl session, but we want to store
bigger HTTP response.
The current H2 to H1 protocol conversion presents some issues which will
require to perform some processing on certain headers before writing them
so it's not possible to convert HPACK to H1 on the fly.
This commit modifies the headers decoding so that it now works in two
phases : hpack_decode_headers() only decodes the HPACK stream in the
HEADERS frame and puts the result into a list. Headers which require
storage (huffman-compressed or from the dynamic table) are stored in
a chunk allocated by the H2 demuxer. Then once the headers are properly
decoded into this list, h2_make_h1_request() is called with this list
to produce the HTTP/1.1 request into the destination buffer. The list
necessarily enforces a limit. Here we use 2*MAX_HTTP_HDR, which means
that we can have as many individual cookies as we have regular headers
if a client decides to break their cookies into multiple values. This
seams reasonable and will allow the H1 parser to decide whether it's
too much or not.
Thus the output stream is not produced on the fly anymore and this will
permit to deal with certain corner cases like reparing the Cookie header
(which for now is not done).
In order to limit header duplication and parsing, the known pseudo headers
continue to be passed by their index : the name element in the list then
has a NULL pointer and the value is the pseudo header's index. Given that
these ones represent about half of the incoming requests and need to be
found quickly, it maintains an acceptable level of performance.
The code was significantly reduced by doing this because the orignal code
had to deal with HPACK and H1 combinations (eg: index vs not indexed, etc)
and now the HPACK decoding is totally focused on the decompression, and
the H1 encoding doesn't have to deal with the issue of wrapping input for
example.
One bug was addressed here (though it couldn't happen at the moment). The
H2 demuxer used to detect a failure to write the request into the H1 buffer
and would then detect if the output buffer wraps, realign it and try again.
The problem by doing so was that the HPACK context was already modified and
not rewindable. Thus the size check is now performed first and a failure is
reported if it doesn't fit.
The current H2 to H1 protocol conversion presents some issues which will
require to perform some processing on certain headers before writing them
so it's not possible to convert HPACK to H1 on the fly.
Here we introduce a function which performs half of what hpack_decode_header()
used to do, which is to take a list of headers on input and emit the
corresponding request in HTTP/1.1 format. The code is the same and functions
were renamed to be prefixed with "h2" instead of "hpack", though it ends
up being simpler as the various HPACK-specific cases could be fused into
a single one (ie: add header).
Moving this part here makes a lot of sense as now this code is specific to
what is documented in HTTP/2 RFC 7540 and will be able to deal with special
cases related to H2 to H1 conversion enumerated in section 8.1.
Various error codes which were previously assigned to HPACK were never
used (aside being negative) and were all replaced by -1 with a comment
indicating what error was detected. The code could be further factored
thanks to this but this commit focuses on compatibility first.
This code is not yet used but builds fine.
While gcc only emits warnings about unused static functions, Clang also
emits such a warning when the functions are inlined. This is a bit
annoying at certain places where functions are provided to manipulate
multiple data types and are not yet used. Let's have a type modifier
"__maybe_unused" which sets the "unused" attribute like the Linux kernel
does. It's elegant as it allows the code author to indicate that it knows
that this element might be unused. It works on variables as well, which
is convenient to remove ifdefs around local variables in certain functions,
but doesn't work on labels.
[ plock commit 4c53fd3a0b2b1892817cebd0db012a52f4087850 ]
Pieter Baauw reported a build issue affecting haproxy after plock was
included. It happens that expressions of the form :
if ((const) ? (expr1) : (expr2))
do_something()
always produce code for both expr1 and expr2 on Clang when building
without optimization. The resulting asm code is even funny, basically
doing :
mov reg, 1
cmp reg, 1
...
This causes our sizeof() tests to fail to build because we purposely
dereference a fake function that reports the location and nature of the
inconsistency, but this fake function appears in the object code despite
all conditions being there to avoid it.
However the compiler is still smart enough to optimize away code doing
if (const)
do_something()
So we simply repeat the condition before do_something(), and the dummy
function is not referenced anymore unless really required.
[ plock commit 61e255286ae32e83e1a3174dd7c49eda99880a8b]
There are a few inlines such as pl_barrier() and pl_cpu_relax() which
are used a lot. Unfortunately, while building test code at -O0, inlining
is disabled and these ones are called a lot and show up a lot in any
profile, are traced into when single-stepping with a debugger, etc, thus
they are polluting the landscape. Since they're single-asm statements,
there is no reason for not turning them into macros.
The result becomes fairly visible here at -O0 :
$ size latency.inline latency.macro
text data bss dec hex filename
11431 692 656 12779 31eb treelock.inline
10967 692 656 12315 301b treelock.macro
And it was verified that regularly optimized code remains strictly identical.
[ plock commit 44081ea493dd78dab48076980e881748e9b33db5 ]
Older compilers (eg: gcc 3.4) don't provide __sync_synchronize() so let's
do it by hand on this platform.
[ plock commit b155d5c762fb9a9793911881f80e61faa6b0e889 ]
Local variables "l", "i" and "ret" were renamed "__pl_l", "__pl_i" and
"__pl_r" respectively, to limit the risk of conflicts with existing
variables in application code.
[ plock commit bfac5887ebabb8ef753b0351f162265767eb219b ]
Local variable "t" was renamed "__pl_t" to limit the risk of conflicts
with existing variables in application code.
This patch adds support for `Type=notify` to the systemd unit.
Supporting `Type=notify` improves both starting as well as reloading
of the unit, because systemd will be let known when the action completed.
See this quote from `systemd.service(5)`:
> Note however that reloading a daemon by sending a signal (as with the
> example line above) is usually not a good choice, because this is an
> asynchronous operation and hence not suitable to order reloads of
> multiple services against each other. It is strongly recommended to
> set ExecReload= to a command that not only triggers a configuration
> reload of the daemon, but also synchronously waits for it to complete.
By making systemd aware of a reload in progress it is able to wait until
the reload actually succeeded.
This patch introduces both a new `USE_SYSTEMD` build option which controls
including the sd-daemon library as well as a `-Ws` runtime option which
runs haproxy in master-worker mode with systemd support.
When haproxy is running in master-worker mode with systemd support it will
send status messages to systemd using `sd_notify(3)` in the following cases:
- The master process forked off the worker processes (READY=1)
- The master process entered the `mworker_reload()` function (RELOADING=1)
- The master process received the SIGUSR1 or SIGTERM signal (STOPPING=1)
Change the unit file to specify `Type=notify` and replace master-worker
mode (`-W`) with master-worker mode with systemd support (`-Ws`).
Future evolutions of this feature could include making use of the `STATUS`
feature of `sd_notify()` to send information about the number of active
connections to systemd. This would require bidirectional communication
between the master and the workers and thus is left for future work.
Instead of storing the SSL_SESSION pointer directly in the struct server,
store the ASN1 representation, otherwise, session resumption is broken with
TLS 1.3, when multiple outgoing connections want to use the same session.
a bitfield has been added to know if there are runnable applets for a
thread. When an applet is woken up, the bits corresponding to its thread_mask
are set. When all active applets for a thread is get to be processed, the thread
is removed from active ones by unsetting its tid_bit from the bitfield.