Now, each proxy contains a lock that must be used when necessary to protect
it. Moreover, all proxy's counters are now updated using atomic operations.
First, we use atomic operations to update jobs/totalconn/actconn variables,
listener's nbconn variable and listener's counters. Then we add a lock on
listeners to protect access to their information. And finally, listener queues
(global and per proxy) are also protected by a lock. Here, because access to
these queues are unusal, we use the same lock for all queues instead of a global
one for the global queue and a lock per proxy for others.
2 global locks have been added to protect, respectively, the run queue and the
wait queue. And a process mask has been added on each task. Like for FDs, this
mask is used to know which threads are allowed to process a task.
For many tasks, all threads are granted. And this must be your first intension
when you create a new task, else you have a good reason to make a task sticky on
some threads. This is then the responsibility to the process callback to lock
what have to be locked in the task context.
Nevertheless, all tasks linked to a session must be sticky on the thread
creating the session. It is important that I/O handlers processing session FDs
and these tasks run on the same thread to avoid conflicts.
Many changes have been made to do so. First, the fd_updt array, where all
pending FDs for polling are stored, is now a thread-local array. Then 3 locks
have been added to protect, respectively, the fdtab array, the fd_cache array
and poll information. In addition, a lock for each entry in the fdtab array has
been added to protect all accesses to a specific FD or its information.
For pollers, according to the poller, the way to manage the concurrency is
different. There is a poller loop on each thread. So the set of monitored FDs
may need to be protected. epoll and kqueue are thread-safe per-se, so there few
things to do to protect these pollers. This is not possible with select and
poll, so there is no sharing between the threads. The poller on each thread is
independant from others.
Finally, per-thread init/deinit functions are used for each pollers and for FD
part for manage thread-local ressources.
Now, you must be carefull when a FD is created during the HAProxy startup. All
update on the FD state must be made in the threads context and never before
their creation. This is mandatory because fd_updt array is thread-local and
initialized only for threads. Because there is no pollers for the main one, this
array remains uninitialized in this context. For this reason, listeners are now
enabled in run_thread_poll_loop function, just like the worker pipe.
hap_register_per_thread_init and hap_register_per_thread_deinit functions has
been added to register functions to do, for each thread, respectively, some
initialization and deinitialization. These functions are added in the global
lists per_thread_init_list and per_thread_deinit_list.
These functions are called only when HAProxy is started with more than 1 thread
(global.nbthread > 1).
Email alerts relies on checks to send emails. The link between a mailers section
and a proxy was resolved during the configuration parsing, But initialization was
done when the first alert is triggered. This implied memory allocations and
tasks creations. With this patch, everything is now initialized during the
configuration parsing. So when an alert is triggered, only the memory required
by this alert is dynamically allocated.
Moreover, alerts processing had a flaw. The task handler used to process alerts
to be sent to the same mailer, process_email_alert, was designed to give back
the control to the scheduler when an alert was sent. So there was a delay
between the sending of 2 consecutives alerts (the min of
"proxy->timeout.connect" and "mailer->timeout.mail"). To fix this problem, now,
we try to process as much queued alerts as possible when the task is woken up.
This is a huge patch with many changes, all about the DNS. Initially, the idea
was to update the DNS part to ease the threads support integration. But quickly,
I started to refactor some parts. And after several iterations, it was
impossible for me to commit the different parts atomically. So, instead of
adding tens of patches, often reworking the same parts, it was easier to merge
all my changes in a uniq patch. Here are all changes made on the DNS.
First, the DNS initialization has been refactored. The DNS configuration parsing
remains untouched, in cfgparse.c. But all checks have been moved in a post-check
callback. In the function dns_finalize_config, for each resolvers, the
nameservers configuration is tested and the task used to manage DNS resolutions
is created. The links between the backend's servers and the resolvers are also
created at this step. Here no connection are kept alive. So there is no needs
anymore to reopen them after HAProxy fork. Connections used to send DNS queries
will be opened on demand.
Then, the way DNS requesters are linked to a DNS resolution has been
reworked. The resolution used by a requester is now referenced into the
dns_requester structure and the resolution pointers in server and dns_srvrq
structures have been removed. wait and curr list of requesters, for a DNS
resolution, have been replaced by a uniq list. And Finally, the way a requester
is removed from a DNS resolution has been simplified. Now everything is done in
dns_unlink_resolution.
srv_set_fqdn function has been simplified. Now, there is only 1 way to set the
server's FQDN, independently it is done by the CLI or when a SRV record is
resolved.
The static DNS resolutions pool has been replaced by a dynamoc pool. The part
has been modified by Baptiste Assmann.
The way the DNS resolutions are triggered by the task or by a health-check has
been totally refactored. Now, all timeouts are respected. Especially
hold.valid. The default frequency to wake up a resolvers is now configurable
using "timeout resolve" parameter.
Now, as documented, as long as invalid repsonses are received, we really wait
all name servers responses before retrying.
As far as possible, resources allocated during DNS configuration parsing are
releases when HAProxy is shutdown.
Beside all these changes, the code has been cleaned to ease code review and the
doc has been updated.
The messages processing is done using existing functions. So here, the main task
is to find the SPOE engine to use. To do so, we loop on all filter instances
attached to the stream. For each, we check if it is a SPOE filter and, if yes,
if its name is the one used to declare the "send-spoe-group" action.
We also take care to return an error if the action processing is interrupted by
HAProxy (because of a timeout or an error at the HAProxy level). This is done by
checking if the flag ACT_FLAG_FINAL is set.
The function spoe_send_group is the action_ptr callback ot
Because we can have messages chained by event or by group, we need to have a way
to know which kind of list we manipulate during the encoding. So 2 types of list
has been added, SPOE_MSGS_BY_EVENT and SPOE_MSGS_BY_GROUP. And the right type is
passed when spoe_encode_messages is called.
This action is used to trigger sending of a group of SPOE messages. To do so,
the SPOE engine used to send messages must be defined, as well as the SPOE group
to send. Of course, the SPOE engine must refer to an existing SPOE filter. If
not engine name is provided on the SPOE filter line, the SPOE agent name must be
used. For example:
http-request send-spoe-group my-engine some-group
This action is available for "tcp-request content", "tcp-response content",
"http-request" and "http-response" rulesets. It cannot be used for tcp
connection/session rulesets because actions for these rulesets cannot yield.
For now, the action keyword is parsed and checked. But it does nothing. Its
processing will be added in another patch.
For now, this section is only parsed. It should have the following format:
spoe-group <grp-name>
messages <msg-name> ...
And then SPOE groups must be referenced in spoe-agent section:
spoe-agnt <name>
...
groups <grp-name> ...
The purpose of these groups is to trigger messages sending from TCP or HTTP
rules, directly from HAProxy configuration, and not on specific event. This part
will be added in another patch.
It is important to note that a message belongs at most to a group.
The engine name is now kept in "spoe_config" struture. Because a SPOE filter can
be declared without engine name, we use the SPOE agent name by default. Then,
its uniqness is checked against all others SPOE engines configured for the same
proxy.
* TODO: Add documentation
Now, it is possible to conditionnaly send a SPOE message by adding an ACL-based
condition on the "event" line, in a "spoe-message" section. Here is the example
coming for the SPOE documentation:
spoe-message get-ip-reputation
args ip=src
event on-client-session if ! { src -f /etc/haproxy/whitelist.lst }
To avoid mixin with proxy's ACLs, each SPOE message has its private ACL list. It
possible to declare named ACLs in "spoe-message" section, using the same syntax
than for proxies. So we can rewrite the previous example to use a named ACL:
spoe-message get-ip-reputation
args ip=src
acl ip-whitelisted src -f /etc/haproxy/whitelist.lst
event on-client-session if ! ip-whitelisted
ACL-based conditions are executed in the context of the stream that handle the
client and the server connections.
It was painful not to have the status code available, especially when
it was computed. Let's store it and ensure we don't claim content-length
anymore on 1xx, only 0 body bytes.
This patch reorganize the shctx API in a generic storage API, separating
the shared SSL session handling from its core.
The shctx API only handles the generic data part, it does not know what
kind of data you use with it.
A shared_context is a storage structure allocated in a shared memory,
allowing its usage in a multithread or a multiprocess context.
The structure use 2 linked list, one containing the available blocks,
and another for the hot locked blocks. At initialization the available
list is filled with <maxblocks> blocks of size <blocksize>. An <extra>
space is initialized outside the list in case you need some specific
storage.
+-----------------------+--------+--------+--------+--------+----
| struct shared_context | extra | block1 | block2 | block3 | ...
+-----------------------+--------+--------+--------+--------+----
<-------- maxblocks --------->
* blocksize
The API allows to store content on several linked blocks. For example,
if you allocated blocks of 16 bytes, and you want to store an object of
60 bytes, the object will be allocated in a row of 4 blocks.
The API was made for LRU usage, each time you get an object, it pushes
the object at the end of the list. When it needs more space, it discards
The functions name have been renamed in a more logical way, the part
regarding shctx have been prefixed by shctx_ and the functions for the
shared ssl session cache have been prefixed by sh_ssl_sess_.
A bind_conf does contain a ssl_bind_conf, which already has a flag to know
if early data are activated, so use that, instead of adding a new flag in
the ssl_options field.
When compiled with Openssl >= 1.1.1, before attempting to do the handshake,
try to read any early data. If any early data is present, then we'll create
the session, read the data, and handle the request before we're doing the
handshake.
For this, we add a new connection flag, CO_FL_EARLY_SSL_HS, which is not
part of the CO_FL_HANDSHAKE set, allowing to proceed with a session even
before an SSL handshake is completed.
As early data do have security implication, we let the origin server know
the request comes from early data by adding the "Early-Data" header, as
specified in this draft from the HTTP working group :
https://datatracker.ietf.org/doc/html/draft-ietf-httpbis-replay
This patch simply brings HAProxy internal regex system to the Lua API.
Lua doesn't embed regexes, now it inherits from the regexes compiled
with haproxy.
In transport-layer functions (snd_buf/rcv_buf), it's very problematic
never to know if polling changes made to the connection will be propagated
or not. This has led to some conn_cond_update_polling() calls being placed
at a few places to cover both the cases where the function is called from
the upper layer and when it's called from the lower layer. With the arrival
of the MUX, this becomes even more complicated, as the upper layer will not
have to manipulate anything from the connection layer directly and will not
have to push such updates directly either. But the snd_buf functions will
need to see their updates committed when called from upper layers.
The solution here is to introduce a connection flag set by the connection
handler (and possibly any other similar place) indicating that the caller
is committed to applying such changes on return. This way, the called
functions will be able to apply such changes by themselves before leaving
when the flag is not set, and the upper layer will not have to care about
that anymore.
These flags are not exactly for the data layer, they instead indicate
what is expected from the transport layer. Since we're going to split
the connection between the transport and the data layers to insert a
mux layer, it's important to have a clear idea of what each layer does.
All function conn_data_* used to manipulate these flags were renamed to
conn_xprt_*.
Functions http_parse_chunk_size(), http_skip_chunk_crlf() and
http_forward_trailers() were moved to h1.h and h1.c respectively so
that they can be called from outside. The parts that were inline
remained inline as it's critical for performance (+41% perf
difference reported in an earlier test). For now the "http_" prefix
remains in their name since they still depend on the http_msg type.
Certain types and enums are very specific to the HTTP/1 parser, and we'll
need to share them with the HTTP/2 to HTTP/1 translation code. Let's move
them to h1.c/h1.h. Those with very few occurrences or only used locally
were renamed to explicitly mention the relevant HTTP version :
enum ht_state -> h1_state.
http_msg_state_str -> h1_msg_state_str
HTTP_FLG_* -> H1_FLG_*
http_char_classes -> h1_char_classes
Others like HTTP_IS_*, HTTP_MSG_* are left to be done later.
Fix regression introduced by commit:
'MAJOR: servers: propagate server status changes asynchronously.'
The building of the log line was re-worked to be done at the
postponed point without lack of data.
[wt: this only affects 1.8-dev, no backport needed]
In order to prepare multi-thread development, code was re-worked
to propagate changes asynchronoulsy.
Servers with pending status changes are registered in a list
and this one is processed and emptied only once 'run poll' loop.
Operational status changes are performed before administrative
status changes.
In a case of multiple operational status change or admin status
change in the same 'run poll' loop iteration, those changes are
merged to reach only the targeted status.
Commit bcb86ab ("MINOR: session: add a streams field to the session
struct") added this list of streams that is not needed anymore. Let's
get rid of it now.
After the removal of CO_FL_DATA_RD_SH and CO_FL_DATA_WR_SH, the
aggregate mask CO_FL_NOTIFY_DATA was not updated. It happens that
now CO_FL_NOTIFY_DATA and CO_FL_NOTIFY_DONE are similar, which may
reveal some overlap between the ->wake and ->xprt_done callbacks.
We'll see after the mux changes if both are still required.
Some places call delete_listener() then decrement the number of
listeners and jobs. At least one other place calls delete_listener()
without doing so, but since it's in deinit(), it's harmless and cannot
risk to cause zombie processes to survive. Given that the number of
listeners and jobs is incremented when creating the listeners, it's
much more logical to symmetrically decrement them when deleting such
listeners.
cfgparse has no business directly calling each individual protocol's 'add'
function to create a listener. Now that they're all registered, better
perform a protocol lookup on the family and have a standard ->add method
for all of them.
Adds cli commands to change at runtime whether informational messages
are prepended with severity level or not, with support for numeric and
worded severity in line with syslog severity level.
Adds stats socket config keyword severity-output to set default behavior
per socket on startup.
These notification management function and structs are generic and
it will be better to move in common parts.
The notification management functions and structs have names
containing some "lua" references because it was written for
the Lua. This patch removes also these references.
The server state and weight was reworked to handle
"pending" values updated by checks/CLI/LUA/agent.
These values are commited to be propagated to the
LB stack.
In further dev related to multi-thread, the commit
will be handled into a sync point.
Pending values are named using the prefix 'next_'
Current values used by the LB stack are named 'cur_'
swap_buffer is a global variable only used by buffer_slow_realign. So it has
been moved from global.h to buffer.c and it is allocated by init_buffer
function. deinit_buffer function has been added to release it. It is also used
to destroy the buffers' pool.
Patch "MINOR: ssl: support ssl-min-ver and ssl-max-ver with crt-list"
introduce ssl_methods in struct ssl_bind_conf. struct bind_conf have now
ssl_methods and ssl_conf.ssl_methods (unused). It's error-prone. This patch
remove the duplicate structure to avoid any confusion.
After careful inspection, this flag is set at exactly two places :
- once in the health-check receive callback after receipt of a
response
- once in the stream interface's shutw() code where CF_SHUTW is
always set on chn->flags
The flag was checked in the checks before deciding to send data, but
when it is set, the wake() callback immediately closes the connection
so the CO_FL_SOCK_WR_SH flag is also set.
The flag was also checked in si_conn_send(), but checking the channel's
flag instead is enough and even reveals that one check involving it
could never match.
So it's time to remove this flag and replace its check with a check of
CF_SHUTW in the stream interface. This way each layer is responsible
for its shutdown, this will ease insertion of the mux layer.
This flag is both confusing and wrong. It is supposed to report the
fact that the data layer has received a shutdown, but in fact this is
reported by CO_FL_SOCK_RD_SH which is set by the transport layer after
this condition is detected. The only case where the flag above is set
is in the stream interface where CF_SHUTR is also set on the receiving
channel.
In addition, it was checked in the health checks code (while never set)
and was always test jointly with CO_FL_SOCK_RD_SH everywhere, except in
conn_data_read0_pending() which incorrectly doesn't match the second
time it's called and is fortunately protected by an extra check on
(ic->flags & CF_SHUTR).
This patch gets rid of the flag completely. Now conn_data_read0_pending()
accurately reports the fact that the transport layer has detected the end
of the stream, regardless of the fact that this state was already consumed,
and the stream interface watches ic->flags&CF_SHUTR to know if the channel
was already closed by the upper layer (which it already used to do).
The now unused conn_data_read0() function was removed.
The session may need to enforce a timeout when waiting for a handshake.
Till now we used a trick to avoid allocating a pointer, we used to set
the connection's owner to the task and set the task's context to the
session, so that it was possible to circle between all of them. The
problem is that we'll really need to pass the pointer to the session
to the upper layers during initialization and that the only place to
store it is conn->owner, which is squatted for this trick.
So this patch moves the struct task* into the session where it should
always have been and ensures conn->owner points to the session until
the data layer is properly initialized.
Historically listeners used to have a handler depending on the upper
layer. But now it's exclusively process_stream() and nothing uses it
anymore so it can safely be removed.
The ->init() callback of the connection's data layer was only used to
complete the session's initialisation since sessions and streams were
split apart in 1.6. The problem is that it creates a big confusion in
the layers' roles as the session has to register a dummy data layer
when waiting for a handshake to complete, then hand it off to the
stream which will replace it.
The real need is to notify that the transport has finished initializing.
This should enable a better splitting between these layers.
This patch thus introduces a connection-specific callback called
xprt_done_cb() which informs about handshake successes or failures. With
this, data->init() can disappear, CO_FL_INIT_DATA as well, and we don't
need to register a dummy data->wake() callback to be notified of errors.
Till now connections used to rely exclusively on file descriptors. It
was planned in the past that alternative solutions would be implemented,
leading to member "union t" presenting sock.fd only for now.
With QUIC, the connection will need to continue to exist but will not
rely on a file descriptor but a connection ID.
So this patch introduces a "connection handle" which is either a file
descriptor or a connection ID, to replace the existing "union t". We've
now removed the intermediate "struct sock" which was never used. There
is no functional change at all, though the struct connection was inflated
by 32 bits on 64-bit platforms due to alignment.