This commit places calls to sockaddr_alloc() at the places where an address
is needed, and makes sure that the allocation is properly tested. This does
not add too many error paths since connection allocations are already in the
vicinity and share the same error paths. For the two cases where a
clear_addr() was called, instead the address was not allocated.
All reads were carefully reviewed for only reading already checked
values. Assignments were commented indicating that an allocation will
be needed once they become dynamic. The memset() used to clear the
addresses should then be turned to a free() and a NULL assignment.
The backend connect code uses conn_get_{from,to}_addr to forward addresses
in transparent mode and to map server ports, without really checking if the
operation succeeds. In preparation of future changes, let's switch to
conn_get_{src,dst}() and integrate status check for possible failures.
The old module proto_http does not exist anymore. All code dedicated to the HTTP
analysis is now grouped in the file proto_htx.c. So, to finish the polishing
after removing the legacy HTTP code, proto_htx.{c,h} files have been moved in
http_ana.{c,h} files.
In addition, all HTX analyzers and related functions prefixed with "htx_" have
been renamed to start with "http_" instead.
The L7 loadbalancing algorithms are concerned (uri, url_param and hdr), the
"sni" parameter on the server line and the "source" parameter on the server line
when used with "use_src hdr_ip()".
If si_connect() failed, do not try to install the mux nor to complete
the operations or add the connection to an idle list, and abort quickly
instead. No obvious side effects were identified, but continuing to
allocate some resources after something has already failed seems risky.
This was a result of a prior fix which already wanted to push this code
further : aa089d80b ("BUG/MEDIUM: server: Defer the mux init until after
xprt has been initialized.") but it ought to have pushed it even further
to maintain the error check just after si_connect().
To be backported to 2.0 and 1.9.
In connect_server(), if there were already a CS assosciated with the stream,
but we can't reuse it, because the target is different (because we tried a
previous connection, it failed, and we use redispatch so we switched servers),
don't forget to set srv_cs to NULL. Otherwise, if we end up reusing another
connection, we would consider we already have a conn_stream, and we won't
create a new one, so we'd have a new connection but we would not be able to
use it.
This can explain frozen streams and connections stuck in CLOSE_WAIT when
using redispatch.
This should be backported to 1.9 and 2.0.
In connect_server(), we used to only call xprt_add_hs() if CO_FL_SEND_PROXY
was set during the function call, we would not do it if the flag was set
before connect_server() was called. The rational at the time was if the flag
was already set, then the XPRT was already present. But now the xprt_handshake
always removes itself, so we have to re-add it each time, or it wouldn't be
done if the first connection attempt failed.
While I'm there, check any non-ssl handshake flag, instead of just
CO_FL_SEND_PROXY, or we'd miss the SOCKS4 flags.
This should be backported to 2.0.
In connect_server(), if we don't yet have a mux, because we're choosing
one depending on the ALPN, don't attempt to send early data. We can't do
it because those data would depend on the mux, that will only be determined
by the handshake.
This should be backported to 1.9.
In connect_server(), don't wait until we negociate the ALPN to choose the
mux, the only mux we want to use is the mux_pt anyway.
This should be backported to 1.9.
Add a new XPRT that is used when using non-SSL handshakes, such as proxy
protocol or Netscaler, instead of taking care of it in conn_fd_handler().
This XPRT is installed when any of those is used, and it removes itself once
the handshake is done.
This should allow us to remove the distinction between CO_FL_SOCK* and
CO_FL_XPRT*.
In connect_server(), when deciding if we should attempt to remove idle
connections, because we have to many file descriptors opened, don't attempt
to do so if idle connection pool is disabled (with pool-max-conn 0), as
if it is, srv->idle_orphan_conns won't even be allocated, and trying to
dereference it will cause a crash.
Have "socks4" and "check-via-socks4" server keyword added.
Implement handshake with SOCKS4 proxy server for tcp stream connection.
See issue #82.
I have the "SOCKS: A protocol for TCP proxy across firewalls" doc found
at "https://www.openssh.com/txt/socks4.protocol". Please reference to it.
[wt: for now connecting to the SOCKS4 proxy over unix sockets is not
supported, and mixing IPv4/IPv6 is discouraged; indeed, the control
layer is unique for a connection and will be used both for connecting
and for target address manipulation. As such it may for example report
incorrect destination addresses in logs if the proxy is reached over
IPv6]
Add session flags, and add a new flag, SESS_FL_PREFER_LAST, to be set when
we use NTLM authentication, and we should reuse the last connection. This
should fix using NTLM with HTX. This totally replaces TX_PREFER_LAST.
This should be backported to 1.9.
The first block is the start-line, if defined. Otherwise it the head of the HTX
message. So now, during HTTP analysis, lookup are all done using the first block
instead of the head. Concretely, for now, it is the same because only one HTTP
message is stored at a time in an HTX message. 1xx informational messages are
handled separatly from the final reponse and from each other. But it will make
sense when the 1xx informational messages and the associated final reponse will
be stored in the same HTX message.
Now, we only return the start-line. If not found, NULL is returned. No lookup is
performed and the HTX message is no more updated. It is now the caller
responsibility to update the position of the start-line to the right value. So
when it is not found, i.e sl_pos is set to -1, it means the last start-line has
been already processed and the next one has not been inserted yet.
It is mandatory to rely on this kind of warranty to store 1xx informational
responses and final reponse in the same HTX message.
Since commit 88698d9 ("MEDIUM: connections: Add a way to control the
number of idling connections.") when building without threads, gcc
complains that the operations made on the idle_orphan_conns[] list is
out of bounds, which is always false since 1) <i> can only equal zero,
and 2) given it's equal to <tid> we never even enter the loop. But as
usual it thinks it knows better, so let's mask the origin of this <i>
value to shut it up. Another solution consists in making <i> unsigned
and adding an explicit range check.
It's always a pain to have to stuff lots of #ifdef USE_OPENSSL around
ssl headers, it even results in some of them appearing in a random order
and multiple times just to benefit form an existing ifdef block. Let's
make these headers safe for inclusion when USE_OPENSSL is not defined,
they now perform the test themselves and do nothing if USE_OPENSSL is
not defined. This allows to remove no less than 8 such ifdef blocks
and make include blocks more readable.
They were all check to comply with the advertised openssl version. Now
that libressl doesn't pretend to be a more recent openssl anymore, we
can simply rely on the regular openssl version tests without having to
deal with exceptions for libressl.
Most tests on OPENSSL_VERSION_NUMBER have become complex and break all
the time because this number is fake for some derivatives like LibreSSL.
This patch creates a new macro, HA_OPENSSL_VERSION_NUMBER, which will
carry the real openssl version defining the compatibility level, and
this version will be adjusted depending on the variants.
Libressl doesn't yet provide early data, so don't put the CO_FL_EARLY_SSL_HS
on the connection if we're building with libressl, or the handshake will
never be done.
Add a new keyword for retry-on, 0rtt-rejected. If set, we will try to
replay requests for which we sent early data that got rejected by the
server.
If that option is set, we will attempt to use 0rtt if "allow-0rtt" is set
on the server line even if the client didn't send early data.
When for some reason the session is not the owner of the connection anymore,
make sure we remove CO_FL_SESS_IDLE, even if we're about to call
conn->mux->destroy(), as the destroy may not destroy the connection
immediately if it's still in use.
This should be backported to 1.9.
u
Don't know why it happens now, but gcc seems to think srv_conn may be NULL when
a reused connection is removed from the orphan list. It happens when HAProxy is
compiled with -O2 with my gcc (8.3.1) on fedora 29... Changing a little how
reuse parameter is tested removes the warnings. So...
This patch may be backported to 1.9.
As by default we add all keepalive connections to the idle pool, if we run
into a pathological case, where all client don't do keepalive, but the server
does, and haproxy is configured to only reuse "safe" connections, we will
soon find ourself having lots of idling, unusable for new sessions, connections,
while we won't have any file descriptors available to create new connections.
To fix this, add 2 new global settings, "pool_low_ratio" and "pool_high_ratio".
pool-low-fd-ratio is the % of fds we're allowed to use (against the maximum
number of fds available to haproxy) before we stop adding connections to the
idle pool, and destroy them instead. The default is 20. pool-high-fd-ratio is
the % of fds we're allowed to use (against the maximum number of fds available
to haproxy) before we start killing idling connection in the event we have to
create a new outgoing connection, and no reuse is possible. The default is 25.
It is mandatory to handle mux upgrades, because during a mux upgrade, the
connection will be reassigned to another multiplexer. So when the old one is
destroyed, it does not own the connection anymore. Or in other words, conn->ctx
does not point to the old mux's context when its destroy() callback is
called. So we now rely on the multiplexer context do destroy it instead of the
connection.
In addition, h1_release() and h2_release() have also been updated in the same
way.
Since LIST_DEL_LOCKED() and LIST_POP_LOCKED() now automatically reinitialize
the removed element, there's no need for keeping this LIST_INIT() call in the
idle connection code.
Instead of having one task per thread and per server that does clean the
idling connections, have only one global task for every servers.
That tasks parses all the servers that currently have idling connections,
and remove half of them, to put them in a per-thread list of connections
to kill. For each thread that does have connections to kill, wake a task
to do so, so that the cleaning will be done in the context of said thread.
Use the locked macros when manipulating idle_orphan_conns, so that other
threads can remove elements from it.
It will be useful later to avoid having a task per server and per thread to
cleanup the orphan list.
Add a per-thread counter of idling connections, and use it to determine
how many connections we should kill after the timeout, instead of using
the global counter, or we're likely to just kill most of the connections.
This should be backported to 1.9.
Use atomic operations when dealing with srv->curr_idle_conns, as it's shared
between threads, otherwise we could get inconsistencies.
This should be backported to 1.9.
A piece of code about the HTX was lost this summer, after the "big merge"
(htx/http2/connection layer refactoring). I forgot to keep HTX changes in the
functions connect_server() and assign_server(). So, this patch fixes "uri",
"url_param" and "hdr" LB algorithms when the HTX is enabled. It also fixes
evaluation of the "sni" expression on server lines.
This issue was reported on github. See issue #32.
This patch must be backported in 1.9.
Commit 3c4e19f42 ("BUG/MEDIUM: backend: always release the previous
connection into its own target srv_list") introduced a valid warning
about a null-deref risk since we didn't check conn_new()'s return value.
This patch must be backported to 1.9 with the patch above.
There was a bug reported in issue #19 regarding the fact that haproxy
could mis-route requests to the wrong server. It turns out that when
switching to another server, the old connection was put back into the
srv_list corresponding to the stream's target instead of this connection's
target. Thus if this connection was later picked, it was pointing to the
wrong server.
The patch fixes this and also clarifies the assignment to srv_conn->target
so that it's clear we don't change it when picking it from the srv_list.
This must be backported to 1.9 only.
If we failed to install the mux, just close the connection, or
conn_fd_handler() will be called for the FD, and crash as soon as it attempts
to access the mux' wake method.
This should be backported to 1.9.
If we failed to add the connection to the session, only try to add it back
to the server idle list if it has a mux, otherwise the connection is
incomplete and unusable.
This should be backported to 1.9.
In connect_server(), if we failed to add the connection to the session,
only destroy the conn_stream if we just allocated it, otherwise it may
have been allocated outside connect_server(), along with a connection which
has its destination address set.
Also use si_release_endpoint() instead of cs_destroy(), to make sure the
stream_interface doesn't reference it anymore.
This should be backported to 1.9.
In case an asynchronous connection (ALPN) succeeds but the mux fails to
attach, we must release the stream interface's endpoint, otherwise we
leave the stream interface with an endpoint pointing to a freed connection
with si_ops == si_conn_ops, and sess_update_st_cer() calls si_shutw() on
it, causing it to crash.
This must be backported to 1.9 only.
In connect_server(), if the previous connection failed, but had an alpn, no
mux was created, and thus the stream_interface's endpoint would be the
connection. In this case, instead of forgetting about it, and overriding
the stream_interface's endpoint later, try to reuse the connection, or the
connection will still be in the session's connection list, and will reference
to a stream that was probably destroyed.
This should be backported to 1.9.
The code dealing with idle connections used to check the number of streams
available on the connection only to unlink the connection from the idle
list. But this still resulted in too many streams reusing the same connection
when they were already attached to it.
We must detect that there is no more room and refrain from using this
connection at all, and instead fall back to the no-reuse case. Ideally
we should try to search among other idle connections, but for a backport
let's stay safe.
This must be backported to 1.9.
The current test consists in removing muxes which report that they're going
to assign their last available stream, but a mux may already be saturated
without having passed in this situation at all. This is what happens with
mux_h2 when receiving a GOAWAY frame informing the mux about the ID of the
last stream the other end is willing to process. The limit suddenly changes
from near infinite to 0. Currently what happens is that such a mux remains
in the idle list for a long time and refuses all new streams. Now at least
it will only fail a single stream in a retryable way. A future improvement
should consist in trying to pick another connection from the idle list.
This fix must be backported to 1.9.
If an ALPN is set on the server line, then when we reach assign_tproxy_address,
the stream_interface's endpoint will be a connection, not a conn_stream,
so make sure assign_tproxy_address() handles both cases.
This should be backported to 1.9.
When an argument <draws> is present, it must be an integer value one
or greater, indicating the number of draws before selecting the least
loaded of these servers. It was indeed demonstrated that picking the
least loaded of two servers is enough to significantly improve the
fairness of the algorithm, by always avoiding to pick the most loaded
server within a farm and getting rid of any bias that could be induced
by the unfair distribution of the consistent list. Higher values N will
take away N-1 of the highest loaded servers at the expense of performance.
With very high values, the algorithm will converge towards the leastconn's
result but much slower. The default value is 2, which generally shows very
good distribution and performance. This algorithm is also known as the
Power of Two Random Choices and is described here :
http://www.eecs.harvard.edu/~michaelm/postscripts/handbook2001.pdf
The algo-specific settings move from the proxy to the LB algo this way :
- uri_whole => arg_opt1
- uri_len_limit => arg_opt2
- uri_dirs_depth1 => arg_opt3
These ones used to rely on separate variables called hh_name/hh_len
but they are exclusive with the former. Let's use the same variable
which becomes a generic argument name and length for the LB algorithm.
There are a few instances where the lookup algo is tested against
BE_LB_LKUP_CHTREE using a binary "AND" operation while this macro
is a value among a set, and not a bit. The test happens to work
because the value is exactly 4 and no bit overlaps with the other
possible values but this is a latent bug waiting for a new LB algo
to appear to strike. At the moment the only other algo sharing a bit
with it is the "first" algo which is never supported in the same code
places.
This fix should be backported to maintained versions for safety if it
passes easily, otherwise it's not important as it will not fix any
visible issue.
The "balance uri" options "whole", "len" and "depth" were not properly
inherited from the defaults sections. In addition, "whole" and "len"
were not even reset when parsing "uri", meaning that 2 subsequent
"balance uri" statements would not have the expected effect as the
options from the first one would remain for the second one.
This may be backported to all maintained versions.
In connect_server(), if we're using a new connection, and we have to
initialize the mux right away, only do it so after si_connect() has been
called. si_connect() is responsible for initializing the xprt, and the
mux initialization may depend on the xprt being usable, as it may try to
receive data. Otherwise, the connection will be flagged as having an error,
and we will have to try to connect a second time.
This should be backported to 1.9.
Instead of keeping track of the number of connections we're responsible for,
keep track of the number of connections we're responsible for that we are
currently considering idling (ie that we are not using, they may be in use
by other sessions), that way we can actually reuse connections when we have
more connections than the max configured.
When connecting to a server, and reusing a connection, always attempt to give
the owner of the previous session one of its own connections, so that one
session won't be responsible for too many connections.
This should be backported to 1.9.
When creating a new outgoing connection, if we're using ALPN and waiting
for the handshake completion to choose the mux, and for some reason the
handshake failed, add the SI_FL_ERR flag to the stream_interface, so that
process_streams() knows the connection failed, and can attempt to retry,
instead of just hanging.
This should be backported to 1.9.
When a session adds a connection to its connection list, we used to remove
connections for an another server if there were not enough room for our
server. This can't work, because those lists are now the list of connections
we're responsible for, not just the idle connections.
To fix this, allow for an unlimited number of servers, instead of using
an array, we're now using a linked list.
To access the first element of the list, correctly use LIST_ELEM(), or we
end up getting the head of the list, instead of getting the first connection.
This should be backported to 1.9.
In connect_server(), if we looked for an usable connection and failed to
find one, srv_conn won't be NULL at the end of list_for_each_entry(), but
will point to the head of a list, which is not a pointer to a struct
connection, so explicitely set it to NULL.
This should be backported to 1.9.
If, for some reason we failed to allocate a conn_stream when reusing an
existing connection, set srv_conn to NULL, so that we fail later, instead
of pretending all is right. This ends up giving a stream_interface with
no endpoint, and so the stream will never end.
This should be backported to 1.9.
In connect_server(), don't attempt to reuse the old connection if it's
targetting a different server than the one we're supposed to access, or
we will never be able to connect to a server if the first one we tried failed.
This should be backported to 1.9.
There was a reference to struct stream in conn_free() for the case
where we're freeing a connection that doesn't have a mux attached.
For now we know it's always a stream, and we only need to do it to
put a NULL in s->si[1].end.
Let's do it better by storing the pointer to si[1].end in the context
and specifying that this pointer is always nulled if the mux is null.
This way it allows a connection to detach itself from wherever it's
being used. Maybe we could even get rid of the condition on the mux.
We most often store the mux context there but it can also be something
else while setting up the connection. Better call it "ctx" and know
that it's the owner's context than misleadingly call it mux_ctx and
get caught doing suspicious tricks.
Add the newly created to the idle list as long as http-reuse != never, and
when completing a H2 request, add the connection to the safe list instead of
the idle list, if we have to add it at that point, that means we created
many streams so we know it's safe.
In session, don't keep an infinite number of connection that can idle.
Add a new frontend parameter, "max-session-srv-conns" to set a max number,
with a default value of 5.
Instead of trying to get the session from the connection, which is not
always there, and of course there could be multiple sessions per connection,
provide it with the init() and attach() methods, so that we know the
session for each outgoing stream.
Add a new command, "pool-max-conn" that sets the maximum number of connections
waiting in the orphan idling connections list (as activated with idle-timeout).
Using "-1" means unlimited. Using pools is now dependant on this.
PiBa-NL reported that since this commit f157384 ("MINOR: backend: count
the number of connect and reuse per server and per backend"), reg-test
connection/h00001 fails. Indeed it does, the server is not checked for
existing prior to updating its counter. It should also fail with
transparent mode.
Sadly we didn't have the cumulated number of connections established to
servers till now, so let's now update it per backend and per-server and
report it in the stats. On the stats page it appears in the tooltip
when hovering over the total sessions count field.
Add a new method to mux, "reset", that is used to let the mux know the
connection attempt failed, and we're about to retry, so it just have to
reinit itself. Currently only the H1 mux needs it.
CS_FL_EOS | CS_FL_REOS can be set by the mux if the connection failed, so make
sure we remove them before retrying to connect, or it may lead to a premature
close of the connection.
In connect_server(), don't attempt to reuse the conn_stream associated to
the stream_interface, if we already attempted a connection with it.
Using that conn_stream is only there for the cases where a connection and
a conn_stream was created ahead, mostly by http_proxy or by the LUA code.
If we already attempted to connect, that means we fail, and so we should
create a new connection.
No backport needed.
In connect_server(), if we already have a conn_stream, reuse it
instead of trying to create a new one. http_proxy and LUA both
manually create a conn_stream and a connection, and we want
to use it.
Add a new keyword for servers, "idle-timeout". If set, unused connections are
kept alive until the timeout happens, and will be picked for reuse if no
other connection is available.
http_proxy is special, because it creates its connection and conn_stream
earlier. So in assign_server(), check that the connection associated with
the conn_stream has a destination address set, and in connect_server(),
use the connection and the conn_stream already attached to the
stream_interface, instead of looking for a connection in the session, and
creating a new conn_stream.
Instead of parsing all the available connections owned by the session
each time we choose a server, even if prefer-last-server is not set,
just do it if prefer-last-server is used, and check if the server is usable,
before checking the connections.
Instead of just storing the last connection in the session, store all of
the connections, for at most MAX_SRV_LIST (currently 5) targets.
That way we can do keepalive on more than 1 outgoing connection when the
client uses HTTP/2.
When creating a new outgoing H2 connection, put it in the idle list so that
it's immediately available for others to use, if http-reuse always is used.
When we're deferring the mux choice until the ALPN is negociated, we
attach the connection to the stream_interface until it's done, so that we
can destroy it if something goes wrong and the stream is destroy.
Before calling si_attach_cs() to attach the conn_stream once we have it,
call si_detach_endpoint(), or is_attach_cs() would destroy the connection.
When we defer the mux choice until the ALPN is negociated, don't forget
to wake the stream once it's done, or it will never have the opportunity
to send data.
This switches explicit calls to various trivial registration methods for
keywords, muxes or protocols from constructors to INITCALL1 at stage
STG_REGISTER. All these calls have in common to consume a single pointer
and return void. Doing this removes 26 constructors. The following calls
were addressed :
- acl_register_keywords
- bind_register_keywords
- cfg_register_keywords
- cli_register_kw
- flt_register_keywords
- http_req_keywords_register
- http_res_keywords_register
- protocol_register
- register_mux_proto
- sample_register_convs
- sample_register_fetches
- srv_register_keywords
- tcp_req_conn_keywords_register
- tcp_req_cont_keywords_register
- tcp_req_sess_keywords_register
- tcp_res_cont_keywords_register
- flt_register_keywords
In commit c7566001 ("MINOR: server: Add "alpn" and "npn" keywords") and
commit 201b9f4e ("MAJOR: connections: Defer mux creation for outgoing
connection if alpn is set"), the build was broken on older OpenSSL
releases.
Move the #ifdef's around so that we build again with older OpenSSL
releases (0.9.8 was tested).
When we create a connection, if we have to defer the conn_stream and the
mux creation until we can decide it (ie until the SSL handshake is done, and
the ALPN is decided), store the connection in the stream_interface, so that
we're sure we can destroy it if needed.
The creation of the conn_stream for an outgoing connection has been delayed
a bit, and when using dispatch, a check was made to see if a conn_stream
was attached before the conn_stream was created, so remove the test, as
it's done later anyway, and create and install the conn_stream right away
when we don't have a server, as is done when we don't have an alpn/npn
defined.
If an ALPN (or a NPN) was chosen for a server, defer choosing the mux until
after the SSL handshake is done, and the ALPN/NPN has been negociated, so
that we know which mux to pick.
Do not destroy the connection when we're about to destroy a stream. This
prevents us from doing keepalive on server connections when the client is
using HTTP/2, as a new stream is created for each request.
Instead, the session is now responsible for destroying connections.
When reusing connections, the attach() mux method is now used to create a new
conn_stream.
Instead of trying to receive as soon as the connection is created, and to
eventually have to transfer subscription if we move connections, wait
until the connection is established before attempting to recv.
Commit 85b73e9 ("BUG/MEDIUM: stream: Make sure polling is right on retry.")
introduced a possible null dereference on the error path detected by gcc-7.
Let's simply assign srv_conn after checking the error and not before.
No backport is needed.
While "option prefer-last-server" only applies to non-deterministic load
balancing algorithms, 401/407 responses actually caused haproxy to prefer
the last server unconditionally.
As this breaks deterministic load balancing algorithms like uri, this
patch applies the same condition here.
Should be backported to 1.8 (together with "BUG/MINOR: only mark
connections private if NTLM is detected").
When retrying to connect to a server, because the previous connection failed,
make sure if we subscribed to the previous connection, the polling flags will
be true for the new fd.
No backport is needed.
The return value from conn_install_mux() was not checked, so if an
inconsistency happens in the code, or a memory allocation fails while
initializing the mux, we can crash while using an uninitialized mux.
In practice the code inconsistency does not really happen since we
cannot configure such a situation, except during development, but
the out of memory condition could definitely happen.
This should be backported to 1.8 (the code is a bit different there,
there are two calls to conn_install_mux()).
This adds the sample fetch 'be_conn_free([<backend>])'. This sample fetch
provides the total number of unused connections across available servers in the
specified backend.
To do so, mux choices are split to handle incoming and outgoing connections in a
different way. The protocol specified on the bind/server line is used in
priority. Then, for frontend connections, the ALPN is retrieved and used to
choose the best mux. For backend connection, there is no ALPN. Finaly, if no
protocol is specified and no protocol matches the ALPN, we fall back on a
default mux, choosing in priority the first mux with exactly the same mode.
The comment above the change remains true. We assume there is always 1
conn_stream per outgoing connectionq. Today, it is always true because H2 is not
supported yet for server connections.
It remained some fragments of the old buffers API in debug messages, here and
there.
This was caused by the recent buffer API changes, no backport is needed.
Chunks are only a subset of a buffer (a non-wrapping version with no head
offset). Despite this we still carry a lot of duplicated code between
buffers and chunks. Replacing chunks with buffers would significantly
reduce the maintenance efforts. This first patch renames the chunk's
fields to match the name and types used by struct buffers, with the goal
of isolating the code changes from the declaration changes.
Most of the changes were made with spatch using this coccinelle script :
@rule_d1@
typedef chunk;
struct chunk chunk;
@@
- chunk.str
+ chunk.area
@rule_d2@
typedef chunk;
struct chunk chunk;
@@
- chunk.len
+ chunk.data
@rule_i1@
typedef chunk;
struct chunk *chunk;
@@
- chunk->str
+ chunk->area
@rule_i2@
typedef chunk;
struct chunk *chunk;
@@
- chunk->len
+ chunk->data
Some minor updates to 3 http functions had to be performed to take size_t
ints instead of ints in order to match the unsigned length here.
Now the buffers only contain the header and a pointer to the storage
area which can be anywhere. This will significantly simplify buffer
swapping and will make it possible to map chunks on buffers as well.
The buf_empty variable was removed, as now it's enough to have size==0
and area==NULL to designate the empty buffer (thus a non-allocated head
is the empty buffer by default). buf_wanted for now is indicated by
size==0 and area==(void *)1.
The channels and the checks now embed the buffer's head, and the only
pointer is to the storage area. This slightly increases the unallocated
buffer size (3 extra ints for the empty buffer) but considerably
simplifies dynamic buffer management. It will also later permit to
detach unused checks.
The way the struct buffer is arranged has proven quite efficient on a
number of tests, which makes sense given that size is always accessed
and often first, followed by the othe ones.
These ones manipulate the output data count which will be specific to
the channel soon, so prepare the call points to use the channel only.
The b_* functions are now unused and were removed.
For large farms where servers are regularly added or removed, picking
a random server from the pool can ensure faster load transitions than
when using round-robin and less traffic surges on the newly added
servers than when using leastconn.
This commit introduces "balance random". It internally uses a random as
the key to the consistent hashing mechanism, thus all features available
in consistent hashing such as weights and bounded load via hash-balance-
factor are usable. It is extremely convenient because one common concern
when using random is what happens when a server is hammered a bit too
much. Here that can trivially be avoided, like in the configuration below :
backend bk0
balance random
hash-balance-factor 110
server-template s 1-100 127.0.0.1:8000 check inter 1s
Note that while "balance random" internally relies on a hash algorithm,
it holds the same properties as round-robin and as such is compatible with
reusing an existing server connection with "option prefer-last-server".
Commit 522eea7 ("MINOR: ssl: Handle sending early data to server.") added
a dependency on SRV_SSL_O_EARLY_DATA which only exists when USE_OPENSSL
is defined (which is probably not the best solution) and breaks the build
when ssl is not enabled. Just add an ifdef USE_OPENSSL around the block
for now.
This adds a new keyword on the "server" line, "allow-0rtt", if set, we'll try
to send early data to the server, as long as the client sent early data, as
in case the server rejects the early data, we no longer have them, and can't
resend them, so the only option we have is to send back a 425, and we need
to be sure the client knows how to interpret it correctly.
All the references to connections in the data path from streams and
stream_interfaces were changed to use conn_streams. Most functions named
"something_conn" were renamed to "something_cs" for this. Sometimes the
connection still is what matters (eg during a connection establishment)
and were not always renamed. The change is significant and minimal at the
same time, and was quite thoroughly tested now. As of this patch, all
accesses to the connection from upper layers go through the pass-through
mux.
For HTTP/2 and QUIC, we'll need to deal with multiplexed streams inside
a connection. After quite a long brainstorming, it appears that the
connection interface to the existing streams is appropriate just like
the connection interface to the lower layers. In fact we need to have
the mux layer in the middle of the connection, between the transport
and the data layer.
A mux can exist on two directions/sides. On the inbound direction, it
instanciates new streams from incoming connections, while on the outbound
direction it muxes streams into outgoing connections. The difference is
visible on the mux->init() call : in one case, an upper context is already
known (outgoing connection), and in the other case, the upper context is
not yet known (incoming connection) and will have to be allocated by the
mux. The session doesn't have to create the new streams anymore, as this
is performed by the mux itself.
This patch introduces this and creates a pass-through mux called
"mux_pt" which is used for all new connections and which only
calls the data layer's recv,send,wake() calls. One incoming stream
is immediately created when init() is called on the inbound direction.
There should not be any visible impact.
Note that the connection's mux is purposely not set until the session
is completed so that we don't accidently run with the wrong mux. This
must not cause any issue as the xprt_done_cb function is always called
prior to using mux's recv/send functions.
A lock for LB parameters has been added inside the proxy structure and atomic
operations have been used to update server variables releated to lb.
The only significant change is about lb_map. Because the servers status are
updated in the sync-point, we can call recalc_server_map function synchronously
in map_set_server_status_up/down function.
For now, we have a list of each type per thread. So there is no need to lock
them. This is the easiest solution for now, but not the best one because there
is no sharing between threads. An idle connection on a thread will not be able
be used by a stream on another thread. So it could be a good idea to rework this
patch later.
Now, each proxy contains a lock that must be used when necessary to protect
it. Moreover, all proxy's counters are now updated using atomic operations.
srv_queue([<backend>/]<server>) : integer
Returns an integer value corresponding to the number of connections currently
pending in the designated server's queue. If <backend> is omitted, then the
server is looked up in the current backend. It can sometimes be used together
with the "use-server" directive to force to use a known faster server when it
is not much loaded. See also the "srv_conn", "avg_queue" and "queue" sample
fetch methods.
The server state and weight was reworked to handle
"pending" values updated by checks/CLI/LUA/agent.
These values are commited to be propagated to the
LB stack.
In further dev related to multi-thread, the commit
will be handled into a sync point.
Pending values are named using the prefix 'next_'
Current values used by the LB stack are named 'cur_'
2 places were using an open-coded implementation of this function to count
available servers. Note that the avg_queue_size() fetch didn't check that
the proxy was in STOPPED state so it would possibly return a wrong server
count here but that wouldn't impact the returned value.
Signed-off-by: Nenad Merdanovic <nmerdan@haproxy.com>
This is like the nbsrv() sample fetch function except that it works as
a converter so it can count the number of available servers of a backend
name retrieved using a sample fetch or an environment variable.
Signed-off-by: Nenad Merdanovic <nmerdan@haproxy.com>
Keeping the address and the port in the same field causes a lot of problems,
specifically on the DNS part where we're forced to cheat on the family to be
able to keep the port. This causes some issues such as some families not being
resolvable anymore.
This patch first moves the service port to a new field "svc_port" so that the
port field is never used anymore in the "addr" field (struct sockaddr_storage).
All call places were adapted (there aren't that many).
when using "option prefer-last-server", we may not always stay on
the same backend if option balance told us otherwise.
For example, backend may change in the following cases:
balance hdr()
balance rdp-cookie
balance source
balance uri
balance url_param
[wt: backport this to 1.7 and 1.6]
According to nbsrv() documentation this fetcher should return "an
integer value corresponding to the number of usable servers".
In case backend is disabled none of servers is usable, so I believe
fetcher should return 0.
This patch should be backported to 1.7, 1.6, 1.5.
Now we exclusively use xprt_get(XPRT_RAW) instead of &raw_sock or
xprt_get(XPRT_SSL) for &ssl_sock. This removes a bunch of #ifdef and
include spread over a number of location including backend, cfgparse,
checks, cli, hlua, log, server and session.
These 2 patches add ability to fetch frontend/backend name in your
logic, so they can be used later to make routing decisions (fe_name) or
taking some actions based on backend which responded to request (be_name).
In our case we needed a fetcher to be able to extract information we
needed from frontend name.
When a backend doesn't use any known LB algorithm, backend_lb_algo_str()
returns NULL. It used to cause "nil" to be printed in the stats dump
since version 1.4 but causes 1.7 to try to parse this NULL to encode
it as a CSV string, causing a crash on "show stat" in this case.
The only situation where this can happen is when "transparent" or
"dispatch" are used in a proxy, in which case the LB algorithm is
BE_LB_ALGO_NONE. Thus now we explicitly report "none" when this
situation is detected, and we preventively report "unknown" if any
unknown algorithm is detected, which may happen if such an algo is
added in the future and the function is not updated.
This fix must be backported to 1.7 and may be backported as far as
1.4, though it has less impact there.
We'll have to use srv_set_admin_flag() to propagate some server flags
during the startup, and we don't want the resulting actions to cause
warnings, logs nor e-mail alerts to be generated since we're just applying
the config or a state file. So let's condition these notifications to the
fact that we're starting.
For active servers, this is the sum of the eweights of all active
servers before this one in the backend, and
[srv->cumulative_weight .. srv_cumulative_weight + srv_eweight) is a
space occupied by this server in the range [0 .. lbprm.tot_wact), and
likewise for backup servers with tot_wbck. This allows choosing a
server or a range of servers proportional to their weight, by simple
integer comparison.
Signed-off-by: Andrew Rodland <andrewr@vimeo.com>
The "sni" server directive does some bad stuff on many occasions because
it works on a sample of type string and limits len to size-1 by hand. The
problem is that size used to be zero on many occasions before the recent
changes to smp_dup() and that it effectively results in setting len to -1
and writing the zero byte *before* the string (and not terminating the
string).
This patch makes use of the recently introduced smp_make_safe() to address
this issue.
This fix must be backported to 1.6.
Since commit 6879ad3 ("MEDIUM: sample: fill the struct sample with the
session, proxy and stream pointers") merged in 1.6-dev2, the sample
contains the pointer to the stream and sample fetch functions as well
as converters use it heavily.
The problem is that earlier commit 87b0966 ("REORG/MAJOR: session:
rename the "session" entity to "stream"") had split the session and
stream resulting in the possibility for smp->strm to be NULL before
the stream was initialized. This is what happens in tcp-request
connection rulesets, as discovered by Baptiste.
The sample fetch functions must now check that smp->strm is valid
before using it. An alternative could consist in using a dummy stream
with nothing in it to avoid some checks but it would only result in
deferring them to the next step anyway, and making it harder to detect
that a stream is valid or the dummy one.
There is still an issue with variables which requires a complete
independant fix. They use strm->sess to find the session with strm
possibly NULL and passed as an argument. All call places indirectly
use smp->strm to build strm. So the problem is there but the API needs
to be changed to remove this duplicate argument that makes it much
harder to know what pointer to use.
This fix must be backported to 1.6, as well as the next one fixing
variables.
There is a bug in connect_server() : we use si_attach_conn() to offer
the current session's connection to the session we're stealing the
connection from. Unfortunately, si_attach_conn() uses the standard data
connection operations while here we need to use the idle connection
operations.
This results in a situation where when the server's idle timeout strikes,
the read0 is silently ignored, causes the response channel to be shut down
for reads, and the connection remains attached. Next attempt to send a
request when using this connection simply results in nothing being done
because we try to send over an already closed connection. Worse, if the
client aborts, then no timeout remains at all and the session waits
forever and remains assigned to the server.
A more-or-less easy way to reproduce this bug is to have two concurrent
streams each connecting to a different server with "http-reuse aggressive",
typically a cache farm using a URL hash :
stream1: GET /1 HTTP/1.1
stream2: GET /2 HTTP/1.1
stream1: GET /2 HTTP/1.1
wait for the server 1's connection to timeout
stream2: GET /1 HTTP/1.1
The connection hangs here, and "show sess all" shows a closed connection
with a SHUTR on the response channel.
The fix is very simple though not optimal. It consists in calling
si_idle_conn() again after attaching the connection. But in practise
it should not be done like this. The real issue is that there's no way
to cleanly attach a connection to a stream interface without changing
the connection's operations. So the API clearly needs to be revisited
to make such operations easier.
Many thanks to Yves Lafon from W3C for providing lots of useful dumps
and testing patches to help figure the root cause!
This fix must be backported to 1.6.
This was the first transparent proxy technology supported by haproxy
circa 2005 but it was obsoleted in 2007 by Tproxy 4.0 which removed a
lot of the earlier versions' shortcomings and was finally merged into
the kernel. Since nobody has been using cttproxy for many years now
and nobody has even just tried to compile the files, it's time to
remove it. The doc was updated as well.
The union name "data" is a little bit heavy while we read the source
code because we can read "data.data.sint". The rename from "data" to "u"
makes the read easiest like "data.u.sint".
This patch remove the struct information stored both in the struct
sample_data and in the striuct sample. Now, only thestruct sample_data
contains data, and the struct sample use the struct sample_data for storing
his own data.
This strategy is less extreme than "always", it only dispatches first
requests to validated reused connections, and moves a connection from
the idle list to the safe list once it has seen a second request, thus
proving that it could be reused.
The "safe" mode consists in picking existing connections only when
processing a request that's not the first one from a connection. This
ensures that in case where the server finally times out and closes, the
client can decide to replay idempotent requests.
Now instead of closing the existing connection attached to the
stream interface, we first check if the one we pick was attached to
another stream interface, in which case the connections are swapped
if possible (eg: if the current connection is not private). That way
the previous connection remains attached to an existing session and
significantly increases the chances of being reused.
In connect_server(), if we don't have a connection attached to the
stream-int, we first look into the server's idle_conns list and we
pick the first one there, we detach it from its owner if it had one.
If we used to have a connection, we close it.
This mechanism works well but doesn't scale : as servers increase,
the likeliness that the connection attached to the stream interface
doesn't match the server and gets closed increases.
This flag is set on an outgoing connection when this connection gets
some properties that must not be shared with other connections, such
as dynamic transparent source binding, SNI or a proxy protocol header,
or an authentication challenge from the server. This will be needed
later to implement connection reuse.
Since we now always call this function with the reuse parameter cleared,
let's simplify the function's logic as it cannot return the existing
connection anymore. The savings on this inline function are appreciable
(240 bytes) :
$ size haproxy.old haproxy.new
text data bss dec hex filename
1020383 40816 36928 1098127 10c18f haproxy.old
1020143 40816 36928 1097887 10c09f haproxy.new
connect_server() already does most of the check that is done again in
si_alloc_conn(), so let's simply reuse the existing connection instead
of calling the function again. It will also simplify the connection
reuse.
Indeed, for reuse to be set, it also requires srv_conn to be valid. In the
end, the only situation where we have to release the existing connection
and allocate a new one is when reuse == 0.
objt_server() is called multiple times at various places while some
places already make use of srv for this. Let's move the call at the
top of the function and use it all over the place.
This patch removes the 32 bits unsigned integer and the 32 bit signed
integer. It replaces these types by a unique type 64 bit signed.
This makes easy the usage of integer and clarify signed and unsigned use.
With the previous version, signed and unsigned are used ones in place of
others, and sometimes the converter loose the sign. For example, divisions
are processed with "unsigned", if one entry is negative, the result is
wrong.
Note that the integer pattern matching and dotted version pattern matching
are already working with signed 64 bits integer values.
There is one user-visible change : the "uint()" and "sint()" sample fetch
functions which used to return a constant integer have been replaced with
a new more natural, unified "int()" function. These functions were only
introduced in the latest 1.6-dev2 so there's no impact on regular
deployments.
The new "sni" server directive takes a sample fetch expression and
uses its return value as a hostname sent as the TLS SNI extension.
A typical use case consists in forwarding the front connection's SNI
value to the server in a bridged HTTPS forwarder :
sni ssl_fc_sni
This patch removes the structs "session", "stream" and "proxy" from
the sample-fetches and converters function prototypes.
This permits to remove some weight in the prototype call.
The get_server_ph_post() function assumes that the buffer is contiguous.
While this is true for all the header part, it is not necessarily true
for the end of data the fit in the reserve. In this case there's a risk
to read past the end of the buffer for a few hundred bytes, and possibly
to crash the process if what follows is not mapped.
The fix consists in truncating the analyzed length to the length of the
contiguous block that follows the headers.
A config workaround for this bug would be to disable balance url_param.
This fix must be backported to 1.5. It seems 1.4 did have the check.
Many such function need a session, and till now they used to dereference
the stream. Once we remove the stream from the embryonic session, this
will not be possible anymore.
So as of now, sample fetch functions will be called with this :
- sess = NULL, strm = NULL : never
- sess = valid, strm = NULL : tcp-req connection
- sess = valid, strm = valid, strm->txn = NULL : tcp-req content
- sess = valid, strm = valid, strm->txn = valid : http-req / http-res
All of them can now retrieve the HTTP transaction *if it exists* from
the stream and be sure to get NULL there when called with an embryonic
session.
The patch is a bit large because many locations were touched (all fetch
functions had to have their prototype adjusted). The opportunity was
taken to also uniformize the call names (the stream is now always "strm"
instead of "l4") and to fix indent where it was broken. This way when
we later introduce the session here there will be less confusion.
Now this one is dynamically allocated. It means that 280 bytes of memory
are saved per TCP stream, but more importantly that it will become
possible to remove the l7 pointer from fetches and converters since
it will be deduced from the stream and will support being null.
A lot of care was taken because it's easy to forget a test somewhere,
and the previous code used to always trust s->txn for being valid, but
all places seem to have been visited.
All HTTP fetch functions check the txn first so we shouldn't have any
issue there even when called from TCP. When branching from a TCP frontend
to an HTTP backend, the txn is properly allocated at the same time as the
hdr_idx.
When s->si[0].end was dereferenced as a connection or anything in
order to retrieve information about the originating session, we'll
now use sess->origin instead so that when we have to chain multiple
streams in HTTP/2, we'll keep accessing the same origin.
Just like for the listener, the frontend is session-wide so let's move
it to the session. There are a lot of places which were changed but the
changes are minimal in fact.
With HTTP/2, we'll have to support multiplexed streams. A stream is in
fact the largest part of what we currently call a session, it has buffers,
logs, etc.
In order to catch any error, this commit removes any reference to the
struct session and tries to rename most "session" occurrences in function
names to "stream" and "sess" to "strm" when that's related to a session.
The files stream.{c,h} were added and session.{c,h} removed.
The session will be reintroduced later and a few parts of the stream
will progressively be moved overthere. It will more or less contain
only what we need in an embryonic session.
Sample fetch functions and converters will have to change a bit so
that they'll use an L5 (session) instead of what's currently called
"L4" which is in fact L6 for now.
Once all changes are completed, we should see approximately this :
L7 - http_txn
L6 - stream
L5 - session
L4 - connection | applet
There will be at most one http_txn per stream, and a same session will
possibly be referenced by multiple streams. A connection will point to
a session and to a stream. The session will hold all the information
we need to keep even when we don't yet have a stream.
Some more cleanup is needed because some code was already far from
being clean. The server queue management still refers to sessions at
many places while comments talk about connections. This will have to
be cleaned up once we have a server-side connection pool manager.
Stream flags "SN_*" still need to be renamed, it doesn't seem like
any of them will need to move to the session.
These 4 combinations are needlessly complicated since the session already
has direct access to the associated stream interfaces without having to
check an indirect pointer.
The purpose of these two macros will be to pass via the session to
find the relevant stream interfaces so that we don't need to store
the ->cons nor ->prod pointers anymore. Currently they're only defined
so that all references could be removed.
Note that many places need a second pass of clean up so that we don't
have any chn_prod(&s->req) anymore and only &s->si[0] instead, and
conversely for the 3 other cases.
The channels were pointers to outside structs and this is not needed
anymore since the buffers have moved, but this complicates operations.
Move them back into the session so that both channels and stream interfaces
are always allocated for a session. Some places (some early sample fetch
functions) used to validate that a channel was NULL prior to dereferencing
it. Now instead we check if chn->buf is NULL and we force it to remain NULL
until the channel is initialized.
When the destination IP is dynamically set, we can't use the "target"
to define the proto. This patch ensures that we always use the protocol
associated with the address family. The proto field was removed from
the server and check structs.
balance hdr(<name>) provides on option 'use_domain_only' to match only the
domain part in a header (designed for the Host header).
Olivier Fredj reported that the hashes were not the same for
'subdomain.domain.tld' and 'domain.tld'.
This is because the pointer was rewinded one step to far, resulting in a hash
calculated against wrong values :
- '.domai' for 'subdomain.domain.tld'
- ' domai' for 'domain.tld' (beginning with the space in the header line)
Another special case is when no dot can be found in the header : the hash will
be calculated against an empty string.
The patch addresses both cases : 'domain' will be used to compute the hash for
'subdomain.domain.tld', 'domain.tld' and 'domain' (using the whole header value
for the last case).
The fix must be backported to haproxy 1.5 and 1.4.
The path "MAJOR: namespace: add Linux network namespace support" doesn't
permit to use internal data producer like a "peers synchronisation"
system. The result is a segfault when the internal application starts.
This patch fix the commit b3e54fe387
It is introduced in 1.6dev version, it doesn't need to be backported.
This patch makes it possible to create binds and servers in separate
namespaces. This can be used to proxy between multiple completely independent
virtual networks (with possibly overlapping IP addresses) and a
non-namespace-aware proxy implementation that supports the proxy protocol (v2).
The setup is something like this:
net1 on VLAN 1 (namespace 1) -\
net2 on VLAN 2 (namespace 2) -- haproxy ==== proxy (namespace 0)
net3 on VLAN 3 (namespace 3) -/
The proxy is configured to make server connections through haproxy and sending
the expected source/target addresses to haproxy using the proxy protocol.
The network namespace setup on the haproxy node is something like this:
= 8< =
$ cat setup.sh
ip netns add 1
ip link add link eth1 type vlan id 1
ip link set eth1.1 netns 1
ip netns exec 1 ip addr add 192.168.91.2/24 dev eth1.1
ip netns exec 1 ip link set eth1.$id up
...
= 8< =
= 8< =
$ cat haproxy.cfg
frontend clients
bind 127.0.0.1:50022 namespace 1 transparent
default_backend scb
backend server
mode tcp
server server1 192.168.122.4:2222 namespace 2 send-proxy-v2
= 8< =
A bind line creates the listener in the specified namespace, and connections
originating from that listener also have their network namespace set to
that of the listener.
A server line either forces the connection to be made in a specified
namespace or may use the namespace from the client-side connection if that
was set.
For more documentation please read the documentation included in the patch
itself.
Signed-off-by: KOVACS Tamas <ktamas@balabit.com>
Signed-off-by: Sarkozi Laszlo <laszlo.sarkozi@balabit.com>
Signed-off-by: KOVACS Krisztian <hidden@balabit.com>
Commit 98634f0 ("MEDIUM: backend: Enhance hash-type directive with an
algorithm options") cleaned up the hashing code by using a centralized
function. A bug appeared in get_server_uh() which is the URI hashing
function. Prior to the patch, the function would stop hashing on the
question mark, or on the trailing slash of a maximum directory count.
Consecutive to the patch, this last character is included into the
hash computation. This means that :
GET /0
GET /0?
Are not hashed similarly. The following configuration reproduces it :
mode http
balance uri
server s1 0.0.0.0:1234 redir /s1
server s2 0.0.0.0:1234 redir /s2
Many thanks to Vedran Furac for reporting this issue. The fix must
be backported to 1.5.
When we were generating a hash, it was done using an unsigned long. When the hash was used
to select a backend, it was sent as an unsigned int. This made it difficult to predict which
backend would be selected.
This patch updates get_hash, and the hash methods to use an unsigned int, to remain consistent
throughout the codebase.
This fix should be backported to 1.5 and probably in part to 1.4.
Checks.c has become a total mess. A number of proxy or server maintenance
and queue management functions were put there probably because they were
used there, but that makes the code untouchable. And that's without saying
that their names does not always relate to what they really do!
So let's do a first pass by moving these ones :
- set_backend_down() => backend.c
- redistribute_pending() => queue.c:pendconn_redistribute()
- check_for_pending() => queue.c:pendconn_grab_from_px()
- shutdown_sessions => server.c:srv_shutdown_sessions()
- shutdown_backup_sessions => server.c:srv_shutdown_backup_sessions()
All of them were moved at once.
Servers used to have 3 flags to store a state, now they have 4 states
instead. This avoids lots of confusion for the 4 remaining undefined
states.
The encoding from the previous to the new states can be represented
this way :
SRV_STF_RUNNING
| SRV_STF_GOINGDOWN
| | SRV_STF_WARMINGUP
| | |
0 x x SRV_ST_STOPPED
1 0 0 SRV_ST_RUNNING
1 0 1 SRV_ST_STARTING
1 1 x SRV_ST_STOPPING
Note that the case where all bits were set used to exist and was randomly
dealt with. For example, the task was not stopped, the throttle value was
still updated and reported in the stats and in the http_server_state header.
It was the same if the server was stopped by the agent or for maintenance.
It's worth noting that the internal function names are still quite confusing.
Now we introduce srv->admin and srv->prev_admin which are bitfields
containing one bit per source of administrative status (maintenance only
for now). For the sake of backwards compatibility we implement a single
source (ADMF_FMAINT) but the code already checks any source (ADMF_MAINT)
where the STF_MAINTAIN bit was previously checked. This will later allow
us to add ADMF_IMAINT for maintenance mode inherited from tracked servers.
Along doing these changes, it appeared that some places will need to be
revisited when implementing the inherited bit, this concerns all those
modifying the ADMF_FMAINT bit (enable/disable actions on the CLI or stats
page), and the checks to report "via" on the stats page. But currently
the code is harmless.
Till now, the server's state and flags were all saved as a single bit
field. It causes some difficulties because we'd like to have an enum
for the state and separate flags.
This commit starts by splitting them in two distinct fields. The first
one is srv->state (with its counter-part srv->prev_state) which are now
enums, but which still contain bits (SRV_STF_*).
The flags now lie in their own field (srv->flags).
The function srv_is_usable() was updated to use the enum as input, since
it already used to deal only with the state.
Note that currently, the maintenance mode is still in the state for
simplicity, but it must move as well.
We used to call srv_is_usable() with either the current state and weights
or the previous ones. This causes trouble for future changes, so let's first
split it in two variants :
- srv_is_usable(srv) considers the current status
- srv_was_usable(srv) considers the previous status
We used to have is_addr() in place to validate sometimes the existence
of an address, sometimes a valid IPv4 or IPv6 address. Replace them
carefully so that is_inet_addr() is used wherever we can only use an
IPv4/IPv6 address.
The RDP cookie extractor compares the 32-bit address from the request
to the address of each server in the farm without first checking that
the server's address is IPv4. This is a leftover from the IPv4 to IPv6
conversion. It's harmless as it's unlikely that IPv4 and IPv6 servers
will be mixed in an RDP farm, but better fix it.
This patch does not need to be backported.
This commit modifies the PROXY protocol V2 specification to support headers
longer than 255 bytes allowing for optional extensions. It implements the
PROXY protocol V2 which is a binary representation of V1. This will make
parsing more efficient for clients who will know in advance exactly how
many bytes to read. Also, it defines and implements some optional PROXY
protocol V2 extensions to send information about downstream SSL/TLS
connections. Support for PROXY protocol V1 remains unchanged.
Finn Arne Gangstad suggested that we should have the ability to break
keep-alive when the target server has reached its maxconn and that a
number of connections are present in the queue. After some discussion
around his proposed patch, the following solution was suggested : have
a per-proxy setting to fix a limit to the number of queued connections
on a server after which we break keep-alive. This ensures that even in
high latency networks where keep-alive is beneficial, we try to find a
different server.
This patch is partially based on his original proposal and implements
this configurable threshold.
All the code inherited from version 1.1 still holds a lot ot sessions
called "t" because in 1.1 they were tasks. This naming is very annoying
and sometimes even confusing, for example in code involving tables.
Let's get rid of this once for all and before 1.5-final.
Nothing changed beyond just carefully renaming these variables.
Cyril Bont reported that the "lastsess" field of a stats-only backend
was never updated. In fact the same is true for any applet and anything
not a server. Also, lastsess was not updated for a server reusing its
connection for a new request.
Since the goal of this field is to report recent activity, it's better
to ensure that all accesses are reported. The call has been moved to
the code validating the session establishment instead, since everything
passes there.
http_body_rewind() returns the number of bytes to rewind before buf->p to
find the message's body. It relies on http_hdr_rewind() to find the beginning
and adds msg->eoh + msg->eol which are always safe.
http_data_rewind() does the same to get the beginning of the data, which
differs from above when a chunk is present. It uses the function above and
adds msg->sol.
The purpose is to centralize further ->sov changes aiming at avoiding
to rely on buf->o.
http_uri_rewind() returns the number of bytes to rewind before buf->p to
find the URI. It relies on http_hdr_rewind() to find the beginning and
is just here to simplify operations.
The purpose is to centralize further ->sov changes aiming at avoiding
to rely on buf->o.
http_hdr_rewind() returns the number of bytes to rewind before buf->p to
find the beginning of headers. At the moment it's not exact as it still
relies on buf->o, assuming that no other data from a past message were
pending there, but it's what was done till there.
The purpose is to centralize further ->sov changes aiming at avoiding
to rely on buf->o.
http_body_bytes() returns the number of bytes of the current message body
present in the buffer. It is compatible with being called before and after
the headers are forwarded.
This is done to centralize further ->sov changes.
We used to have msg->sov updated for every chunk that was parsed. The issue
is that we want to be able to rewind after chunks were parsed in case we need
to redispatch a request and perform a new hash on the request or insert a
different server header name.
Currently, msg->sov and msg->next make parallel progress. We reached a point
where they're always equal because msg->next is initialized from msg->sov,
and is subtracted msg->sov's value each time msg->sov bytes are forwarded.
So we can now ensure that msg->sov can always be replaced by msg->next for
every state after HTTP_MSG_BODY where it is used as a position counter.
This allows us to keep msg->sov untouched whatever the number of chunks that
are parsed, as is needed to extract data from POST request (eg: url_param).
However, we still need to know the starting position of the data relative to
the body, which differs by the chunk size length. We use msg->sol for this
since it's now always zero and unused in the body.
So with this patch, we have the following situation :
- msg->sov = msg->eoh + msg->eol = size of the headers including last CRLF
- msg->sol = length of the chunk size if any. So msg->sov + msg->sol = DATA.
- msg->next corresponds to the byte being inspected based on the current
state and is always >= msg->sov before starting to forward anything.
Since sov and next are updated in case of header rewriting, a rewind will
fix them both when needed. Of course, ->sol has no reason for changing in
such conditions, so it's fine to keep it relative to msg->sov.
In theory, even if a redispatch has to be performed, a transformation
occurring on the request would still work because the data moved would
still appear at the same place relative to bug->p.
This is the continuation of previous patch. Now that full buffers are
not rejected anymore, let's wait for at least the advertised chunk or
body length to be present or the buffer to be full. When either
condition is met, the message processing can go forward.
Thus we don't need to use url_param_post_limit anymore, which was passed
in the configuration as an optionnal <max_wait> parameter after the
"check_post" value. This setting was necessary when the feature was
implemented because there was no support for parsing message bodies.
The argument is now silently ignored if set in the configuration.
Finn Arne Gangstad reported that commit 6b726adb35 ("MEDIUM: http: do
not report connection errors for second and further requests") breaks
support for serving static files by abusing the errorfile 503 statement.
Indeed, a second request over a connection sent to any server or backend
returning 503 would silently be dropped.
The proper solution consists in adding a flag on the session indicating
that the server connection was reused, and to only avoid the error code
in this case.
Since 1.5-dev20, we have a working server-side keep-alive and an option
"prefer-last-server" to indicate that we explicitly want to reuse the
same server as the last one. Unfortunately this breaks the redispatch
feature because assign_server() insists on reusing the same server as
the first one attempted even if the connection failed to establish.
A simple solution consists in only considering the last connection if
it was connected. Otherwise there is no reason for being interested in
reusing the same server.
Summary:
Track and report last session time on the stats page for each server
in every backend, as well as the backend.
This attempts to address the requirement in the ROADMAP
- add a last activity date for each server (req/resp) that will be
displayed in the stats. It will be useful with soft stop.
The stats page reports this as time elapsed since last session. This
change does not adequately address the requirement for long running
session (websocket, RDP... etc).
In HTTP keep-alive mode, if we receive a 401, we still have a chance
of being able to send the visitor again to the same server over the
same connection. This is required by some broken protocols such as
NTLM, and anyway whenever there is an opportunity for sending the
challenge to the proper place, it's better to do it (at least it
helps with debugging).
If we reuse a server-side connection, we must not reinitialize its context nor
try to enable send_proxy. At the moment HTTP keep-alive over SSL fails on the
first attempt because the SSL context was cleared, so it only worked after a
retry.
When the load balancing algorithm in use is not deterministic, and a previous
request was sent to a server to which haproxy still holds a connection, it is
sometimes desirable that subsequent requests on a same session go to the same
server as much as possible. Note that this is different from persistence, as
we only indicate a preference which haproxy tries to apply without any form
of warranty. The real use is for keep-alive connections sent to servers. When
this option is used, haproxy will try to reuse the same connection that is
attached to the server instead of rebalancing to another server, causing a
close of the connection. This can make sense for static file servers. It does
not make much sense to use this in combination with hashing algorithms.
This commit allows an existing server-side connection to be reused if
it matches the same target. Basic controls are performed ; right now
we do not allow to reuse a connection when dynamic source binding is
in use or when the destination address or port is dynamic (eg: proxy
mode). Later we'll have to also disable connection sharing when PROXY
protocol is being used or when non-idempotent requests are processed.
When allocating a new connection, only the caller knows whether it's
acceptable to reuse the previous one or not. Let's pass this information
to si_alloc_conn() which will do the cleanup if the connection is not
acceptable.
Having the check state partially stored in the server doesn't help.
Some functions such as srv_getinter() rely on the server being checked
to decide what check frequency to use, instead of relying on the check
being configured. So let's get rid of SRV_CHECKED and SRV_AGENT_CHECKED
and only use the check's states instead.
Till now the send_proxy_ofs field remained in the stream interface,
but since the dynamic allocation of the connection, it makes a lot
of sense to move that into the connection instead of the stream
interface, since it will not be statically allocated for each
session.
Also, it turns out that moving it to the connection fils an alignment
hole on 64 bit architectures so it does not consume more memory, and
removing it from the stream interface was an opportunity to correctly
reorder fields and reduce the stream interface's size from 160 to 144
bytes (-10%). This is 32 bytes saved per session.
The outgoing connection is now allocated dynamically upon the first attempt
to touch the connection's source or destination address. If this allocation
fails, we fail on SN_ERR_RESOURCE.
As we didn't use si->conn anymore, it was removed. The endpoints are released
upon session_free(), on the error path, and upon a new transaction. That way
we are able to carry the existing server's address across retries.
The stream interfaces are not initialized anymore before session_complete(),
so we could even think about allocating them dynamically as well, though
that would not provide much savings.
The session initialization now makes use of conn_new()/conn_free(). This
slightly simplifies the code and makes it more logical. The connection
initialization code is now shorter by about 120 bytes because it's done
at once, allowing the compiler to remove all redundant initializations.
The si_attach_applet() function now takes care of first detaching the
existing endpoint, and it is called from stream_int_register_handler(),
so we can safely remove the calls to si_release_endpoint() in the
application code around this call.
A call to si_detach() was made upon stream_int_unregister_handler() to
ensure we always free the allocated connection if one was allocated in
parallel to setting an applet (eg: detect HTTP proxy while proceeding
with stats maybe).
si_prepare_conn() is not appropriate in our case as it both initializes and
attaches the connection to the stream interface. Due to the asymmetry between
accept() and connect(), it causes some fields such as the control and transport
layers to be reinitialized.
Now that we can separately initialize these fields using conn_prepare(), let's
break this function to only attach the connection to the stream interface.
Also, by analogy, si_prepare_none() was renamed si_detach(), and
si_prepare_applet() was renamed si_attach_applet().
The connection will only remain there as a pre-allocated entity whose
goal is to be placed in ->end when establishing an outgoing connection.
All connection initialization can be made on this connection, but all
information retrieved should be applied to the end point only.
This change is huge because there were many users of si->conn. Now the
only users are those who initialize the new connection. The difficulty
appears in a few places such as backend.c, proto_http.c, peers.c where
si->conn is used to hold the connection's target address before assigning
the connection to the stream interface. This is why we have to keep
si->conn for now. A future improvement might consist in dynamically
allocating the connection when it is needed.
This function makes no sense anymore and will cause trouble to convert
the remains of connection/applet to end points. Let's replace it now
with its contents.
A very old bug resulting from some code refactoring causes
assign_server_address() to refrain from retrieving the destination
address from the client-side connection when transparent mode is
enabled and we're connecting to a server which has address 0.0.0.0.
The impact is low since such configurations are unlikely to ever
be encountered. The fix should be backported to older branches.
This function was designed for haproxy while testing other functions
in the past. Initially it was not planned to be used given the not
very interesting numbers it showed on real URL data : it is not as
smooth as the other ones. But later tests showed that the other ones
are extremely sensible to the server count and the type of input data,
especially DJB2 which must not be used on numeric input. So in fact
this function is still a generally average performer and it can make
sense to merge it in the end, as it can provide an alternative to
sdbm+avalanche or djb2+avalanche for consistent hashing or when hashing
on numeric data such as a source IP address or a visitor identifier in
a URL parameter.
Summary:
Avalanche is supported not as a native hashing choice, but a modifier
on the hashing function. Note that this means that possible configs
written after 1.5-dev4 using "hash-type avalanche" will get an informative
error instead. But as discussed on the mailing list it seems nobody ever
used it anyway, so let's fix it before the final 1.5 release.
The default values were selected for backward compatibility with previous
releases, as discussed on the mailing list, which means that the consistent
hashing will still apply the avalanche hash by default when no explicit
algorithm is specified.
Examples
(default) hash-type map-based
Map based hashing using sdbm without avalanche
(default) hash-type consistent
Consistent hashing using sdbm with avalanche
Additional Examples:
(a) hash-type map-based sdbm
Same as default for map-based above
(b) hash-type map-based sdbm avalanche
Map based hashing using sdbm with avalanche
(c) hash-type map-based djb2
Map based hashing using djb2 without avalanche
(d) hash-type map-based djb2 avalanche
Map based hashing using djb2 with avalanche
(e) hash-type consistent sdbm avalanche
Same as default for consistent above
(f) hash-type consistent sdbm
Consistent hashing using sdbm without avalanche
(g) hash-type consistent djb2
Consistent hashing using djb2 without avalanche
(h) hash-type consistent djb2 avalanche
Consistent hashing using djb2 with avalanche
Summary:
In testing at tumblr, we found that using djb2 hashing instead of the
default sdbm hashing resulted is better workload distribution to our backends.
This commit implements a change, that allows the user to specify the hash
function they want to use. It does not limit itself to consistent hashing
scenarios.
The supported hash functions are sdbm (default), and djb2.
For a discussion of the feature and analysis, see mailing list thread
"Consistent hashing alternative to sdbm" :
http://marc.info/?l=haproxy&m=138213693909219
Note: This change does NOT make changes to new features, for instance,
applying an avalance hashing always being performed before applying
consistent hashing.
This function is also called directly from backend.c, so let's stop
building fake args to call it as a sample fetch, and have a lower
layer more generic function instead.
We're having a lot of duplicate code just because of minor variants between
fetch functions that could be dealt with if the functions had the pointer to
the original keyword, so let's pass it as the last argument. An earlier
version used to pass a pointer to the sample_fetch element, but this is not
the best solution for two reasons :
- fetch functions will solely rely on the keyword string
- some other smp_fetch_* users do not have the pointer to the original
keyword and were forced to pass NULL.
So finally we're passing a pointer to the keyword as a const char *, which
perfectly fits the original purpose.
Benoit Dolez reported a failure to start haproxy 1.5-dev19. The
process would immediately report an internal error with missing
fetches from some crap instead of ACL names.
The cause is that some versions of gcc seem to trim static structs
containing a variable array when moving them to BSS, and only keep
the fixed size, which is just a list head for all ACL and sample
fetch keywords. This was confirmed at least with gcc 3.4.6. And we
can't move these structs to const because they contain a list element
which is needed to link all of them together during the parsing.
The bug indeed appeared with 1.5-dev19 because it's the first one
to have some empty ACL keyword lists.
One solution is to impose -fno-zero-initialized-in-bss to everyone
but this is not really nice. Another solution consists in ensuring
the struct is never empty so that it does not move there. The easy
solution consists in having a non-null list head since it's not yet
initialized.
A new "ILH" list head type was thus created for this purpose : create
an Initialized List Head so that gcc cannot move the struct to BSS.
This fixes the issue for this version of gcc and does not create any
burden for the declarations.
This patch does not change the logic of the code, it only changes the
way OS-specific defines are tested.
At the moment the transparent proxy code heavily depends on Linux-specific
defines. This first patch introduces a new define "CONFIG_HAP_TRANSPARENT"
which is set every time the defines used by transparent proxy are present.
This also means that with an up-to-date libc, it should not be necessary
anymore to force CONFIG_HAP_LINUX_TPROXY during the build, as the flags
will automatically be detected.
The CTTPROXY flags still remain separate because this older API doesn't
work the same way.
A new line has been added in the version output for haproxy -vv to indicate
what transparent proxy support is available.
Now that ACLs solely rely on sample fetch functions, make them use the
same arg mask. All inconsistencies have been fixed separately prior to
this patch, so this patch almost only adds a new pointer indirection
and removes all references to ARG*() in the definitions.
The parsing is still performed by the ACL code though.
ACL fetch functions used to directly reference a fetch function. Now
that all ACL fetches have their sample fetches equivalent, we can make
ACLs reference a sample fetch keyword instead.
In order to simplify the code, a sample keyword name may be NULL if it
is the same as the ACL's, which is the most common case.
A minor change appeared, http_auth always expects one argument though
the ACL allowed it to be missing and reported as such afterwards, so
fix the ACL to match this. This is not really a bug.
The following sample fetch functions were only usable by ACLs but are now
usable by sample fetches too :
avg_queue, be_conn, be_id, be_sess_rate, connslots, nbsrv,
queue, srv_conn, srv_id, srv_is_up, srv_sess_rate
The fetch functions have been renamed "smp_fetch_*".
The file acl.c is a real mess, it both contains functions to parse and
process ACLs, and some sample extraction functions which act on buffers.
Some other payload analysers were arbitrarily dispatched to proto_tcp.c.
So now we're moving all payload-based fetches and ACLs to payload.c
which is capable of extracting data from buffers and rely on everything
that is protocol-independant. That way we can safely inflate this file
and only use the other ones when some fetches are really specific (eg:
HTTP, SSL, ...).
As a result of this cleanup, the following new sample fetches became
available even if they're not really useful :
always_false, always_true, rep_ssl_hello_type, rdp_cookie_cnt,
req_len, req_ssl_hello_type, req_ssl_sni, req_ssl_ver, wait_end
The function 'acl_fetch_nothing' was wrong and never used anywhere so it
was removed.
The "rdp_cookie" sample fetch used to have a mandatory argument while it
was optional in ACLs, which are supposed to iterate over RDP cookies. So
we're making it optional as a fetch too, and it will return the first one.
This is just like previous commit, but for the backend this time. All this
code did not need to remain duplicated. These are 500 more bytes shaved off.
Both servers and proxies share a common set of parameters for outgoing
connections, and since they're not stored in a similar structure, a lot
of code is duplicated in the connection setup, which is one sensible
area.
Let's first define a common struct for these settings and make use of it.
Next patches will de-duplicate code.
This change also fixes a build breakage that happens when USE_LINUX_TPROXY
is not set but USE_CTTPROXY is set, which seem to be very unlikely
considering that the issue was introduced almost 2 years ago an never
reported.
Considering there is no option yet for maxconnrate for servers, I wrote
an ACL to check a backend server session rate which we use to send to an
"overflow" backend to prevent latency responses to our clients (very
sensitive latency requirements).
Instead of storing a couple of (int, ptr) in the struct connection
and the struct session, we use a different method : we only store a
pointer to an integer which is stored inside the target object and
which contains a unique type identifier. That way, the pointer allows
us to retrieve the object type (by dereferencing it) and the object's
address (by computing the displacement in the target structure). The
NULL pointer always corresponds to OBJ_TYPE_NONE.
This reduces the size of the connection and session structs. It also
simplifies target assignment and compare.
In order to improve the generated code, we try to put the obj_type
element at the beginning of all the structs (listener, server, proxy,
si_applet), so that the original and target pointers are always equal.
A lot of code was touched by massive replaces, but the changes are not
that important.
We will need to be able to switch server connections on a session and
to keep idle connections. In order to achieve this, the preliminary
requirement is that the connections can survive the session and be
detached from them.
Right now they're still allocated at exactly the same place, so when
there is a session, there are always 2 connections. We could soon
improve on this by allocating the outgoing connection only during a
connect().
This current patch touches a lot of code and intentionally does not
change any functionnality. Performance tests show no regression (even
a very minor improvement). The doc has not yet been updated.
With this commit, we now separate the channel from the buffer. This will
allow us to replace buffers on the fly without touching the channel. Since
nobody is supposed to keep a reference to a buffer anymore, doing so is not
a problem and will also permit some copy-less data manipulation.
Interestingly, these changes have shown a 2% performance increase on some
workloads, probably due to a better cache placement of data.
While working on the changes required to make the health checks use the
new connections, it started to become obvious that some naming was not
logical at all in the connections. Specifically, it is not logical to
call the "data layer" the layer which is in charge for all the handshake
and which does not yet provide a data layer once established until a
session has allocated all the required buffers.
In fact, it's more a transport layer, which makes much more sense. The
transport layer offers a medium on which data can transit, and it offers
the functions to move these data when the upper layer requests this. And
it is the upper layer which iterates over the transport layer's functions
to move data which should be called the data layer.
The use case where it's obvious is with embryonic sessions : an incoming
SSL connection is accepted. Only the connection is allocated, not the
buffers nor stream interface, etc... The connection handles the SSL
handshake by itself. Once this handshake is complete, we can't use the
data functions because the buffers and stream interface are not there
yet. Hence we have to first call a specific function to complete the
session initialization, after which we'll be able to use the data
functions. This clearly proves that SSL here is only a transport layer
and that the stream interface constitutes the data layer.
A similar change will be performed to rename app_cb => data, but the
two could not be in the same commit for obvious reasons.
Alex Markham reported and diagnosed a bug appearing on 1.5-dev11,
causing a crash on x86_64 when header hashing is used. The cause is
a missing (int) cast causing a negative offset to appear positive
and the resulting pointer to go out of bounds.
The crash is not possible anymore since 1.5-dev12 because a second
bug caused the negative sign to disappear so the pointer is always
within range but always wrong, so balance hdr() never works anymore.
This fix restores the correct behaviour and ensures the sign is
correct.
The last uses of the stream interfaces were in tcp_connect_server() and
could easily and more appropriately be moved to its callers, si_connect()
and connect_server(), making a lot more sense.
Now the function should theorically be usable for health checks.
It also appears more obvious that the file is split into two distinct
parts :
- the protocol layer used at the connection level
- the tcp analysers executing tcp-* rules and their samples/acls.
We need to have the source and destination addresses in the connection.
They were lying in the stream interface so let's move them. The flags
SI_FL_FROM_SET and SI_FL_TO_SET have been moved as well.
It's worth noting that tcp_connect_server() almost does not use the
stream interface anymore except for a few flags.
It has been identified that once we detach the connection from the SI,
it will probably be needed to keep a copy of the server-side addresses
in the SI just for logging purposes. This has not been implemented right
now though.
Some parts of the sock_ops structure were only used by the stream
interface and have been moved into si_ops. Some of them were callbacks
to the stream interface from the connection and have been moved into
app_cp as they're the application seen from the connection (later,
health-checks will need to use them). The rest has moved to data_ops.
Normally at this point the connection could live without knowing about
stream interfaces at all.
The "raw_sock" prefix will be more convenient for naming functions as
it will be prefixed with the data layer and suffixed with the data
direction. So let's rename the files now to avoid any further confusion.
The #include directive was also removed from a number of files which do
not need it anymore.
At the moment, the struct is still embedded into the struct channel, but
all the functions have been updated to use struct buffer only when possible,
otherwise struct channel. Some functions would likely need to be splitted
between a buffer-layer primitive and a channel-layer function.
Later the buffer should become a pointer in the struct buffer, but doing so
requires a few changes to the buffer allocation calls.
This is a massive rename. We'll then split channel and buffer.
This change needs a lot of cleanups. At many locations, the parameter
or variable is still called "buf" which will become ambiguous. Also,
the "struct channel" is still defined in buffers.h.
This patch brings a new "whole" parameter to "balance uri" which makes
the hash work over the whole uri, not just the part before the query
string. Len and depth parameter are still honnored.
The reason for this new feature is explained below.
I have 3 backend servers, each accepting different form of HTTP queries:
http://backend1.server.tld/service1.php?q=...
http://backend1.server.tld/service2.php?q=...
http://backend2.server.tld/index.php?query=...&subquery=...
http://backend3.server.tld/image/49b8c0d9ff
Each backend server returns a different response based on either:
- the URI path (the left part of the URI before the question mark)
- the query string (the right part of the URI after the question mark)
- or the combination of both
I wanted to set up a common caching cluster (using 6 Squid servers, each
configured as reverse proxy for those 3 backends) and have HAProxy balance
the queries among the Squid servers based on URL. I also wanted to achieve
hight cache hit ration on each Squid server and send the same queries to
the same Squid servers. Initially I was considering using the 'balance uri'
algorithm, but that would not work as in case of backend2 all queries would
go to only one Squid server. The 'balance url_param' would not work either
as it would send the backend3 queries to only one Squid server.
So I thought the simplest solution would be to use 'balance uri', but to
calculate the hash based on the whole URI (URI path + query string),
instead of just the URI path.
We start to move everything needed to manage a connection to a special
entity "struct connection". We have the data layer operations and the
control operations there. We'll also have more info in the future such
as file descriptors and applet contexts, so that in the end it becomes
detachable from the stream interface, which will allow connections to
be reused between sessions.
For now on, we start with minimal changes.
Since the recent buffer reorg, msg->som is redundant with buf->p but still
appears at a number of places. This tiny patch allows to confirm that som
follows two states :
- 0 from the moment the message starts to be parsed
- relative offset to ->p for start of chunk when parsing chunks
During this second state, ->sol is never used, so we should probably merge
the two.
This is a left-over from the buffer changes. Msg->sol is always null at the
end of the parsing, so we must not use it anymore to read headers or find
the beginning of a message. As a side effect, the dump of the request in
debug mode is working again because it was relying on msg->sol not being
null.
Maybe it will even be mergeable with another of the message pointers.
The recent split between the buffers and HTTP messages in 1.5-dev9 caused
a major trouble : in the past, we used to keep a pointer to HTTP data in the
buffer struct itself, which was the cause of most of the pain we had to deal
with buffers.
Now the two are split but we lost the information about the beginning of
the HTTP message once it's being forwarded. While it seems normal, it happens
that several parts of the code currently rely on this ability to inspect a
buffer containing old contents :
- balance uri
- balance url_param
- balance url_param check_post
- balance hdr()
- balance rdp-cookie()
- http-send-name-header
All these happen after the data are scheduled for being forwarded, which
also causes a server to be selected. So for a long time we've been relying
on supposedly sent data that we still had a pointer to.
Now that we don't have such a pointer anymore, we only have one possibility :
when we need to inspect such data, we have to rewind the buffer so that ->p
points to where it previously was. We're lucky, no data can leave the buffer
before it's being connecting outside, and since no inspection can begin until
it's empty, we know that the skipped data are exactly ->o. So we rewind the
buffer by ->o to get headers and advance it back by the same amount.
Proceeding this way is particularly important when dealing with chunked-
encoded requests, because the ->som and ->sov fields may be reused by the
chunk parser before the connection attempt is made, so we cannot rely on
them.
Also, we need to be able to come back after retries and redispatches, which
might change the size of the request if http-send-name-header is set. All of
this is accounted for by the output queue so in the end it does not look like
a bad solution.
No backport is needed.
Instead of hard-coding sock_raw in connect_server(), we set this socket
operation at config parsing time. Right now, only servers and peers have
it. Proxies are still hard-coded as sock_raw. This will be needed for
future work on SSL which requires a different socket layer.
Commit e164e7a removed get_src/get_dst setting in the stream interfaces but
forgot to set it in proto_tcp. Get the feature back because we need it for
logging, transparent mode, ACLs etc... We now rely on the stream interface
direction to know what syscall to use.
One benefit of doing it this way is that we don't use getsockopt() anymore
on outgoing stream interfaces nor on UNIX sockets.
We'll soon have an SSL socket layer, and in order to ease the difference
between the two, we use the name "sock_raw" to designate the one which
directly talks to the sockets without any conversion.
Patterns were using a bitmask to indicate if request or response was desired
in fetch functions and keywords. ACLs were using a bitmask in fetch keywords
and a single bit in fetch functions. ACLs were also using an ACL_PARTIAL bit
in fetch functions indicating that a non-final fetch was performed, which was
an abuse of the existing direction flag.
The change now consists in using :
- a capabilities field for fetch keywords => SMP_CAP_REQ/RES to indicate
if a keyword supports requests, responses, both, etc...
- an option field for fetch functions to indicate what the caller expects
(request/response, final/non-final)
The ACL_PARTIAL bit was reversed to get SMP_OPT_FINAL as it's more explicit
to know we're working on a final buffer than on a non-final one.
ACL_DIR_* were removed, as well as PATTERN_FETCH_*. L4 fetches were improved
to support being called on responses too since they're still available.
The <dir> field of all fetch functions was changed to <opt> which is now
unsigned.
The patch is large but mostly made of cosmetic changes to accomodate this, as
almost no logic change happened.
Having the args everywhere will make it easier to share fetch functions
between patterns and ACLs. The only place where we could have needed
the expr was in the http_prefetch function which can do well without.
Previously, both pattern, backend and persist_rdp_cookie would build fake
ACL expressions to fetch an RDP cookie by calling acl_fetch_rdp_cookie().
Now we switch roles. The RDP cookie fetch function is provided as a sample
fetch function that all others rely on, including ACL. The code is exactly
the same, only the args handling moved from expr->args to args. The code
was moved to proto_tcp.c, but probably that a dedicated file would be more
suited to content handling.
These ones were either unused or improperly used. Some integers were marked
read-only, which does not make much sense. Buffers are not read-only, they're
"constant" in that they must be kept intact after any possible change.
This one is not needed anymore as we can return the data and its type in the
sample provided by the caller. ACLs now always return the proper type. BOOL
is already returned when the result is expected to be processed as a boolean.
temp_pattern has been unexported now.
The new sample types are necessary for the acl-pattern convergence.
These types are boolean and signed int. Some types were renamed for
less ambiguity (ip->ipv4, integer->uint).
A large number of ACLs make use of frontend, backend or table names in their
arguments, and fall back to the current proxy when no argument is passed. If
the expected capability is not available, the ACL silently fails at runtime.
Now we make all those names mandatory in the parser and we rely on
acl_find_targets() to replace the missing names with the holding proxy,
then to perform the appropriate tests, and to reject errors at parsing
time.
It is possible that some faulty configurations will get rejected from now
on, while they used to silently fail till now. This is the reason why this
change is marked as MAJOR.
Proxy names are now resolved when the config is parsed and not at runtime.
This means that errors will be caught for real instead of having an ACL
silently never match. Another benefit is that the fetch will be much faster
since the lookup will not have to be performed anymore, eg for all ACLs
based on explicitly named stick-tables.
However some buggy configurations which used to silently fail in the past
will now refuse to load, hence the MAJOR tag.
The types and minimal number of ACL keyword arguments are now stored in
their declaration. This will allow many more fantasies if some ACL use
several arguments or types.
Doing so required to rework all ACL keyword declarations to add two
parameters. So this was a good opportunity for a general cleanup and
to sort all entries in alphabetical order.
We still have two pending issues :
- parse_acl_expr() checks for errors but has no way to report them to
the user ;
- the types of some arguments are still not resolved and kept as strings
(eg: ARGT_FE/BE/TAB) for compatibility reasons, which must be resolved
in acl_find_targets()
The ACL parser now uses the argument parser to build a typed argument list.
Right now arguments are all strings and only one argument is supported since
this is what ACLs currently support.
msg->sol is now a relative pointer just like all other ones. There is no
more absolute references to the buffer outside the struct buffer itself.
Next two cleanups should include removing buffer references to functions
which already have an msg, and removal of wrapping detection in request
and response parsing which cannot wrap by definition.
These offsets were relative to the buffer itself. Now they're relative to
the buffer's origin (buf->p) which normally corresponds to the start of
current message.
This saves a big dependency between the HTTP message struct and the buffers.
It appeared during this change that ->col is not used anymore (it will have
to be removed). Next step is to turn ->eol and ->sol from absolute to relative.
We don't have buf->l anymore. We have buf->i for pending data and
the total length is retrieved by adding buf->o. Some computation
already become simpler.
Despite extreme care, bugs are not excluded.
It's worth noting that msg->err_pos as set by HTTP request/response
analysers becomes relative to pending data and not to the beginning
of the buffer. This has not been completed yet so differences might
occur when outgoing data are left in the buffer.
These callbacks are used to retrieve the source and destination address
of a socket. The address flags are not hold on the stream interface and
not on the session anymore. The addresses are collected when needed.
This still needs to be improved to store the IP and port separately so
that it is not needed to perform a getsockname() when only the IP address
is desired for outgoing traffic.
The hash of IPv6 addresses was not properly aligned and resulted in the
last quarter of the address not being hashed. In practice, this is rarely
detected since MAC addresses are used in the second half. But this becomes
very visible with IPv6-mapped IPv4 addresses such as ::FFFF:1.2.3.4 where
the IPv4 part is never hashed.
This bug has been there forever, since introduction of "balance source" in
v1.2.11. The fix must then be backported to all stable versions.
Thanks to Alex Markham for reporting this issue to the list !
%Bi return the backend source IP
%Bp return the backend source port
Add a function pointer in logformat_type to do additional configuration
during the log-format variable parsing.
The principle behind this load balancing algorithm was first imagined
and modeled by Steen Larsen then iteratively refined through several
work sessions until it would totally address its original goal.
The purpose of this algorithm is to always use the smallest number of
servers so that extra servers can be powered off during non-intensive
hours. Additional tools may be used to do that work, possibly by
locally monitoring the servers' activity.
The first server with available connection slots receives the connection.
The servers are choosen from the lowest numeric identifier to the highest
(see server parameter "id"), which defaults to the server's position in
the farm. Once a server reaches its maxconn value, the next server is used.
It does not make sense to use this algorithm without setting maxconn. Note
that it can however make sense to use minconn so that servers are not used
at full load before starting new servers, and so that introduction of new
servers requires a progressively increasing load (the number of servers
would more or less follow the square root of the load until maxconn is
reached). This algorithm ignores the server weight, and is more beneficial
to long sessions such as RDP or IMAP than HTTP, though it can be useful
there too.
The new function does not return IP addresses but header values instead,
so that the caller is free to make what it want of them. The conversion
is not quite clean yet, as the previous test which considered that address
0.0.0.0 meant "no address" is still used. A different IP parsing function
should be used to take this into account.
Now strings and data blocks are stored in the temp_pattern's chunk
and matched against this one.
The rdp_cookie currently makes extensive use of acl_fetch_rdp_cookie()
and will be a good candidate for the initial rework so that ACLs use
the patterns framework and not the other way around.
All ACL fetches which return integer value now store the result into
the temporary pattern struct. All ACL matches which rely on integer
also get their value there.
Note: the pattern data types are not set right now.
This is 1.5-specific. It causes issues with transparent source binding involving
hdr_ip. We must not try to bind() to a foreign address when the family is not set,
and we must set the family when an address is set.
Stream interfaces used to distinguish between client and server addresses
because they were previously of different types (sockaddr_storage for the
client, sockaddr_in for the server). This is not the case anymore, and this
distinction is confusing at best and has caused a number of regressions to
be introduced in the process of converting everything to full-ipv6. We can
now remove this and have a much cleaner code.
Nick Chalk reported that a connection to a server which has no port specified
used twice the port number. The reason is that the port number was taken from
the wrong part of the address, the client's destination address was used as the
base port instead of the server's configured address.
Thanks to Nick for his helpful diagnostic.
A similar issue as the previous one causes port mapping to fail in some
combinations of client and server address families. Using the macros fixes
the issue.
Adding health checks has become a real pain, with cross-references to all
checks everywhere because they're all a single bit. Since they're all
exclusive, let's change this to have a check number only. We reserve 4
bits allowing up to 16 checks (15+tcp), only 7 of which are currently
used. The code has shrunk by almost 1kB and we saved a few option bits.
The "dispatch" option has been moved to px->options, making a few tests
a bit cleaner.
Since we now have the copy of the target in the session, use it instead
of relying on the SI for it. The SI drops the target upon unregister()
so applets such as stats were logged as "NOSRV".
This option enables use of the PROXY protocol with the server, which
allows haproxy to transport original client's address across multiple
architecture layers.
It's very annoying that frontend and backend stats are merged because we
don't know what we're observing. For instance, if a "listen" instance
makes use of a distinct backend, it's impossible to know what the bytes_out
means.
Some points take care of not updating counters twice if the backend points
to the frontend, indicating a "listen" instance. The thing becomes more
complex when we try to add support for server side keep-alive, because we
have to maintain a pointer to the backend used for last request, and to
update its stats. But we can't perform such comparisons anymore because
the counters will not match anymore.
So in order to get rid of this situation, let's have both frontend AND
backend stats in the "struct proxy". We simply update the relevant ones
during activity. Some of them are only accounted for in the backend,
while others are just for frontend. Maybe we can improve a bit on that
later, but the essential part is that those counters now reflect what
they really mean.
This patch turns internal server addresses to sockaddr_storage to
store IPv6 addresses, and makes the connect() function use it. This
code already works but some caveats with getaddrinfo/gethostbyname
still need to be sorted out while the changes had to be merged at
this stage of internal architecture changes. So for now the config
parser will not emit an IPv6 address yet so that user experience
remains unchanged.
This change should have absolutely zero user-visible effect, otherwise
it's a bug introduced during the merge, that should be reported ASAP.
This one has been removed and is now totally superseded by ->target.
To get the server, one must use target_srv(&s->target) instead of
s->srv now.
The function ensures that non-server targets still return NULL.
s->prev_srv is used by assign_server() only, but all code paths leading
to it now take s->prev_srv from the existing s->srv. So assign_server()
can do that copy into its own stack.
If at one point a different srv is needed, we still have a copy of the
last server on which we failed a connection attempt in s->target.
When dealing with HTTP keep-alive, we'll have to know if we can reuse
an existing connection. For that, we'll have to check if the current
connection was made on the exact same target (referenced in the stream
interface).
Thus, we need to first assign the next target to the session, then
copy it to the stream interface upon connect(). Later we'll check for
equivalence between those two operations.
Till now we used the fact that the dispatch address was not null to use
the dispatch mode. This is very unconvenient, so let's have a dedicated
option.
When doing a connect() on a stream interface, some information is needed
from the server and from the backend. In some situations, we don't have
a server and only a backend (eg: peers). In other cases, we know we have
an applet and we don't want to connect to anything, but we'd still like
to have the info about the applet being used.
For this, we now store a pointer to the "target" into the stream interface.
The target describes what's on the other side before trying to connect. It
can be a server, a proxy or an applet for now. Later we'll probably have
descriptors for multiple-stage chains so that the final information may
still be found.
This will help removing many specific cases in the code. It already made
it possible to remove the "srv" and "be" parameters to tcpv4_connect_server().
Bryan Talbot reported that POST requests with a query string were not
correctly processed if the hash parameter was the first one, because
the delimiter that was looked for to trigger the parsing was '&' instead
of '?'.
Also, while checking the code, it became apparent that it was enough for
a query string to be present in the request for POST parameters to be
ignored, even if the url_param was in the body and not in the URL.
The code has then been fixed like this :
1) look for URL param. If found, return it.
2) if no URL param was found and method is POST, then look it up into
the body
The code now seems to pass all request combinations.
This patch must be backported to 1.4 since 1.4 is equally broken right now.
Till now, the forwarding code was making use of the hdr_content_len member
to hold the size of the last chunk parsed. As such, it was reset after being
scheduled for forwarding. The issue is that this entry was reset before the
data could be viewed by backend.c in order to parse a POST body, so the
"balance url_param check_post" did not work anymore.
In order to fix this, we need two things :
- the chunk size (reset upon every forward)
- the total body size (not reset)
hdr_content_len was thus replaced by the former (hence the size of the patch)
as it makes more sense to have it stored that way than the way around.
This patch should be backported to 1.4 with care, considering that it affects
the forwarding code.
When the number of servers is a multiple of the size of the input set,
map-based hash can be inefficient. This typically happens with 64
servers when doing URI hashing. The "avalanche" hash-type applies an
avalanche hash before performing a map lookup in order to smooth the
distribution. The result is slightly less smooth than the map for small
numbers of servers, but still better than the consistent hashing.
Till now when a server was configured with address 0.0.0.0, the
connection was forwarded to this address which generally is intercepted
by the system as a local address, so this was completely useless.
One sometimes useful feature for outgoing transparent proxies is to
be able to forward the connection to the same address the client
requested. This patch fixes the meaning of 0.0.0.0 precisely to
ensure that the connection will be forwarded to the initial client's
destination address.
It's not normal to initialize the server-side stream interface from the
accept() function, because it may change later. Thus, we introduce a new
stream_sock_prepare_interface() function which is called just before the
connect() and which sets all of the stream_interface's callbacks to the
default ones used for real sockets. The ->connect function is also set
at the same instant so that we can easily add new server-side protocols
soon.
The 'client.c' file now only contained frontend-specific functions,
so it has naturally be renamed 'frontend.c'. Same for client.h. This
has also been an opportunity to remove some cross references from
files that should not have depended on it.
In the end, this file should contain a protocol-agnostic accept()
code, which would initialize a session, task, etc... based on an
accept() from a lower layer. Right now there are still references
to TCP.
Some ACLs in the client ought to belong to proto_tcp, or protocols.
This file should only contain frontend-specific information and will
be renamed that way in next commit.
Some functions which act on generic buffer contents without being
tcp-specific were historically in proto_tcp.c. This concerns ACLs
and RDP cookies. Those have been moved away to more appropriate
locations. Ideally we should create some new files for each layer6
protocol parser. Let's do that later.
This ACL was missing in complex setups where the status of a remote site
has to be considered in switching decisions. Until there, using a server's
status in an ACL required to have a dedicated backend, which is a bit heavy
when multiple servers have to be monitored.
Using get_ip_from_hdr2() we can look for occurrence #X or #-X and
extract the IP it contains. This is typically designed for use with
the X-Forwarded-For header.
Using "usesrc hdr_ip(name,occ)", it becomes possible to use the IP address
found in <name>, and possibly specify occurrence number <occ>, as the
source to connect to a server. This is possible both in a server and in
a backend's source statement. This is typically used to use the source
IP previously set by a upstream proxy.
The transparent proxy address selection was set in the TCP connect function
which is not the most appropriate place since this function has limited
access to the amount of parameters which could produce a source address.
Instead, now we determine the source address in backend.c:connect_server(),
right after calling assign_server_address() and we assign this address in
the session and pass it to the TCP connect function. This cannot be performed
in assign_server_address() itself because in some cases (transparent mode,
dispatch mode or http_proxy mode), we assign the address somewhere else.
This change will open the ability to bind to addresses extracted from many
other criteria (eg: from a header).
Isidore Li reported an occasional segfault when using URL hashing, and
kindly provided backtraces and core files to help debugging.
The problem was triggered by reset connections before the URL was sent,
and was due to the same bug which was fixed by commit e45997661b
(connections were attempted in case of connection abort). While that
bug was already fixed, it appeared that the same segfault could be
triggered when URL hashing is configured in an HTTP backend when the
frontend runs in TCP mode and no URL was seen. It is totally abnormal
to try to hash a null URL, as well as to process any kind of L7 hashing
when a full request was not seen.
This additional fix now ensures that layer7 hashing is not performed on
incomplete requests.
This is used to force access to down servers for some requests. This
is useful when validating that a change on a server correctly works
before enabling the server again.
Some message pointers were not usable once the message reached the
HTTP_MSG_DONE state. This is the case for ->som which points to the
body because it is needed to parse chunks. There is one case where
we need the beginning of the message : server redirect. We have to
call http_get_path() after the request has been parsed. So we rely
on ->sol without counting on ->som. In order to achieve this, we're
making ->rq.{u,v} relative to the beginning of the message instead
of the buffer. That simplifies the code and makes it cleaner.
Preliminary tests show this is OK.
Supported informations, available via "tr/td title":
- cap: capabilities (proxy)
- mode: one of tcp, http or health (proxy)
- id: SNMP ID (proxy, socket, server)
- IP (socket, server)
- cookie (backend, server)
The rq.u field is relative to buf->data, not to msg->sol. We have
to subtract msg->som everywhere this error was made. Maybe it will
be simpler to have a pointer to the buffer in the message and find
appropriate data there.
When parsing body for URL parameters, we must not consider that
data are available from buf->data but from buf->data + msg->som.
This is not a problem right now but may become with keep-alive.
Now that the HTTP analyser will already have parsed the beginning
of the request body, we don't have to check for transfer-encoding
anymore since we have the current chunk size in hdr_content_len.
These ACLs are used to check the number of active connections on the
frontend, backend or in a backend's queue. The avg_queue returns the
average number of queued connections per server, and for this, divides
the total number of queued connections by the number of alive servers.
The dst_conn ACL has been slightly changed to more reflect its name and
original usage, which is to return the number of connections on the
destination address/port (the socket) and not the whole frontend.
Consistent hashing provides some interesting advantages over common
hashing. It avoids full redistribution in case of a server failure,
or when expanding the farm. This has a cost however, the hashing is
far from being perfect, as we associate a server to a request by
searching the server with the closest key in a tree. Since servers
appear multiple times based on their weights, it is recommended to
use weights larger than approximately 10-20 in order to smoothen
the distribution a bit.
In some cases, playing with weights will be the only solution to
make a server appear more often and increase chances of being picked,
so stats are very important with consistent hashing.
In order to indicate the type of hashing, use :
hash-type map-based (default, old one)
hash-type consistent (new one)
Consistent hashing can make sense in a cache farm, in order not
to redistribute everyone when a cache changes state. It could also
probably be used for long sessions such as terminal sessions, though
that has not be attempted yet.
More details on this method of hashing here :
http://www.spiteful.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/
There are a few remaining max values that need to move to counters.
Also, the counters are more often used than some config information,
so get them closer to the other useful struct members for better cache
efficiency.
The "static-rr" is just the old round-robin algorithm. It is still
in use when a hash algorithm is used and the data to hash is not
present, but it was impossible to configure it explicitly. This one
is cheaper in terms of CPU and supports unlimited numbers of servers,
so it makes sense to be able to use it.
LB algo macros were composed of the LB algo by itself without any indication
of the method to use to look up a server (the lb function itself). This
method was implied by the LB algo, which was not very convenient to add
more algorithms. Now we have several fields in the LB macros, some to
describe what to look for in the requests, some to describe how to transform
that (kind of algo) and some to describe what lookup function to use.
The next patch will make it possible to factor out some code for all algos
which rely on a map.
We need to remove hash map accesses out of backend.c if we want to
later support new hash methods. This patch separates the hash computation
method from the server lookup. It leaves the lookup function to lb_map.c
and calls it with the result of the hash.
It was becoming painful to have all the LB algos in backend.c.
Let's move them to their own files. A few hashing functions still
need be broken in two parts, one for the contents and one for the
map position.
There is no reason to inline functions which are used to grab a server
depending on an LB algo. They are large and used at several places.
Uninlining them saves 400 bytes of code.
Please consider the following patches. They are required to
compile haproxy-1.4-dev2 on FreeBSD.
Summary:
1) include <sys/types.h> before <netinet/tcp.h>
2) Use IPPROTO_TCP instead of SOL_TCP
(they are both defined as 6, TCP protocol number)
The connection establishment was completely handled by backend.c which
normally just handles LB algos. Since it's purely TCP, it must move to
proto_tcp.c. Also, instead of calling it directly, we now call it via
the stream interface, which will later help us unify session handling.
Andrew Azarov reported that haproxy-1.4-dev1 does not build
under FreeBSD 7.2 because SOL_TCP is not defined. So add a
check for its definition before using it. This only impacts
network optimisations anyway.
This Linux-specific option was never really used in production and
has since been superseded by new splicing options brought by recent
Linux kernels.
It caused several particular cases in the code because the kernel
would take care of the session without haproxy being able to do
anything on it, which became hard to handle in the new architecture.
Let's simply get rid of it now that there is a replacement available.
This patch adds support for hashing RDP cookies in order to
use them as a load-balancing key. The new "rdp-cookie(name)"
load-balancing metric has to be used for this. It is still
mandatory to wait for an RDP cookie in the frontend, otherwise
it will randomly work.
The splice code did not consider compatibility between both ends
of the connection. Now we set different capabilities on each
stream interface, depending on what the protocol can splice to/from.
Right now, only TCP is supported. Thanks to this, we're now able to
automatically detect when splice() is not implemented and automatically
disable it on one end instead of reporting errors to the upper layer.
This new option enables combining of request buffer data with
the initial ACK of an outgoing TCP connection. Doing so saves
one packet per connection which is quite noticeable on workloads
mostly consisting in small objects. The option is not enabled by
default.
Setting TCP_CORK on a socket before sending the last segment enables
automatic merging of this segment with the FIN from the shutdown()
call. Playing with TCP_CORK is not easy though as we have to track
the status of the TCP_NODELAY flag since both are mutually exclusive.
Doing so saves one more packet per session and offers about 5% more
performance.
There is no reason not to do it, so there is no associated option.
Some users are already hitting the 64k source port limit when
connecting to servers. The system usually maintains a list of
unused source ports, regardless of the source IP they're bound
to. So in order to go beyond the 64k concurrent connections, we
have to manage the source ip:port lists ourselves.
The solution consists in assigning a source port range to each
server and use a free port in that range when connecting to that
server, either for a proxied connection or for a health check.
The port must then be put back into the server's range when the
connection is closed.
This mechanism is used only when a port range is specified on
a server. It makes it possible to reach 64k connections per
server, possibly all from the same IP address. Right now it
should be more than enough even for huge deployments.
There is a patch made by me that allow for balancing on any http header
field.
[WT:
made minor changes:
- turned 'balance header name' into 'balance hdr(name)' to match more
closely the ACL syntax for easier future convergence
- renamed the proxy structure fields header_* => hh_*
- made it possible to use the domain name reduction to any header, not
only "host" since it makes sense to do it with other ones.
Otherwise patch looks good.
/WT]
The connect timeout was not properly detected due to the fact that
it was not correctly initialized. It must be set as the stream interface
timeout, not the buffer's write timeout.
These new ACLs match frontend session rate and backend session rate.
Examples are provided in the doc to explain how to use that in order
to limit abuse of service.
With this change, all frontends, backends, and servers maintain a session
counter and a timer to compute a session rate over the last second. This
value will be very useful because it varies instantly and can be used to
check thresholds. This value is also reported in the stats in a new "rate"
column.
Specifying "interface <name>" after the "source" statement allows
one to bind to a specific interface for proxy<->server traffic.
This makes it possible to use multiple links to reach multiple
servers, and to force traffic to pass via an interface different
from the one the system would have chosen based on the routing
table.
Setting "nosplice" in the global section will disable the use of TCP
splicing (both tcpsplice and linux 2.6 splice). The same will be
achieved using the "-dS" parameter on the command line.
In the buffers, the read limit used to leave some place for header
rewriting was set by a pointer to the end of the buffer. Not only
this required subtracts at every place in the code, but this will
also soon not be usable anymore when we want to support keepalive.
Let's replace this with a length limit, comparable to the buffer's
length. This has also sightly reduced the code size.
"option transparent" was set and checked on frontends only while it
is purely a backend thing as it replaces the "balance" mode. For this
reason, it did only work in "listen" sections. This change will then
not affect the rare users of this option.
I'm in the process of setting up one haproxy instance now, and I find
the following acl option useful. I'm not too sure why this option has
not been available before, but I find this useful for my own usage, so
I'm submitting this patch in the hope that it will be useful as well.
The basic idea is to be able to measure the available connection slots
still available (connection, + queue) - anything beyond that can be
redirected to a different backend. 'connslots' = number of available
server connection slots, + number of available server queue slots. In
the case where we encounter srv maxconn = 0, or srv maxqueue = 0 (in
which case we dont need to care about connslots) the value you get is
-1. Note also that this code does not take care of dynamic connections
at this point in time.
The reason why I'm using this new acl (as opposed to 'nbsrv') is that
'nbsrv' only measures servers that are actually *down*. Whereas this
other acl is more fine-grained, and looks into the number of conn
slots available as well.
The new function looks like the previous one except that it operates
at the stream interface level and assumes an already closed SI.
Also remove some old unused occurrences of srv_close_with_err().
It is quite hard to track when the current session has already been counted
or discounted from the server's total number of established sessions. For
this reason, we introduce a new session flag, SN_CURR_SESS, which indicates
if the current session is one of those reported by the server or not. It
simplifies session accounting and makes it far more robust. It also makes
it possible to perform a last-minute cleanup during session_free().
Right now, with this fix and a few more buffer transitions fixes, no session
were found to remain after a test.
The connection setup code has been refactored in order to
make it run only on low level (stream interface). Several
complicated functions have been removed from backend.c,
and we now have sess_update_stream_int() to manage
an assigned connection, sess_prepare_conn_req() to assign a
server to a connection request, perform_http_redirect() to
redirect instead of connecting to server, and return_srv_error()
to return connection error status messages.
The stream_interface status changes are checked before adjusting
buffer flags, so that the buffers can be informed about this lower
level update.
A new connection is initiated by changing si->state from SI_ST_INI
to SI_ST_REQ.
The code seems to work but is awfully dirty. Some functions need
to be moved, and the layering is not yet quite clear.
A lot of dead old code has simply been removed.
It was not practical to have QUEUE and TAR timers in buffers, as they caused
triggering of the timeout flags. Move them to the stream interface where they
belong.
As of now, a stream socket does not directly wake up the task
but it does contact the stream interface which itself knows the
task. This allows us to perform a few cleanups upon errors and
shutdowns, which reduces the number of calls to data_update()
from 8 per session to 2 per session, and make all the functions
called in the process_session() loop completely swappable.
Some improvements are required. We need to provide a shutw()
function on stream interfaces so that one side which closes
its read part on an empty buffer can propagate the close to
the remote side.
srv_state has been removed from HTTP state machines, and states
have been split in either TCP states or analyzers. For instance,
the TARPIT state has just become a simple analyzer.
New flags have been added to the struct buffer to compensate this.
The high-level stream processors sometimes need to force a disconnection
without touching a file-descriptor (eg: report an error). But if
they touched BF_SHUTW or BF_SHUTR, the file descriptor would not
be closed. Thus, the two SHUT?_NOW flags have been added so that
an application can request a forced close which the stream interface
will be forced to obey.
During this change, a new BF_HIJACK flag was added. It will
be used for data generation, eg during a stats dump. It
prevents the producer on a buffer from sending data into it.
BF_SHUTR_NOW /* the producer must shut down for reads ASAP */
BF_SHUTW_NOW /* the consumer must shut down for writes ASAP */
BF_HIJACK /* the producer is temporarily replaced */
BF_SHUTW_NOW has precedence over BF_HIJACK. BF_HIJACK has
precedence over BF_MAY_FORWARD (so that it does not need it).
New functions buffer_shutr_now(), buffer_shutw_now(), buffer_abort()
are provided to manipulate BF_SHUT* flags.
A new type "stream_interface" has been added to describe both
sides of a buffer. A stream interface has states and error
reporting. The session now has two stream interfaces (one per
side). Each buffer has stream_interface pointers to both
consumer and producer sides.
The server-side file descriptor has moved to its stream interface,
so that even the buffer has access to it.
process_srv() has been split into three parts :
- tcp_get_connection() obtains a connection to the server
- tcp_connection_failed() tests if a previously attempted
connection has succeeded or not.
- process_srv_data() only manages the data phase, and in
this sense should be roughly equivalent to process_cli.
Little code has been removed, and a lot of old code has been
left in comments for now.
In backend.c, we had an EV_FD_SET() called before fd_insert().
This is wrong because fd_insert updates maxfd which might be
used by some of the pollers during EV_FD_SET(), although this
is not currently the case.
It's a shame not to use buffer->wex for connection timeouts since by
definition it cannot be used till the connection is not established.
Using it instead of ->cex also makes the buffer processing more
symmetric.
The SV_STANALYZE state was installed on the server side but was really
meant to be processed with the rest of the request on the client side.
It suffered from several issues, mostly related to the way timeouts were
handled while waiting for data.
All known issues related to timeouts during a request - and specifically
a request involving body processing - have been raised and fixed. At this
point, the code is a bit dirty but works fine, so next steps might be
cleanups with an ability to come back to the current state in case of
trouble.
If an HTTP/0.9-like POST request is sent to haproxy while
configured with url_param + check_post, it will crash. The
reason is that the total buffer length was computed based
on req->total (which equals the number of bytes read) and
not req->l (number of bytes in the buffer), thus leading
to wrong size calculations when calling memchr().
The affected code does not look like it could have been
exploited to run arbitrary code, only reads were performed
at wrong locations.
All currently known ACL verbs have been assigned a type which makes
it possible to detect inconsistencies, such as response values used
in request rules.
It should be stated as a rule that a C file should never
include types/xxx.h when proto/xxx.h exists, as it gives
less exposure to declaration conflicts (one of which was
caught and fixed here) and it complicates the file headers
for nothing.
Only types/global.h, types/capture.h and types/polling.h
have been found to be valid includes from C files.
This is the first attempt at moving all internal parts from
using struct timeval to integer ticks. Those provides simpler
and faster code due to simplified operations, and this change
also saved about 64 bytes per session.
A new header file has been added : include/common/ticks.h.
It is possible that some functions should finally not be inlined
because they're used quite a lot (eg: tick_first, tick_add_ifset
and tick_is_expired). More measurements are required in order to
decide whether this is interesting or not.
Some function and variable names are still subject to change for
a better overall logics.
The dequeuing logic was completely wrong. First, a task was assigned
to all servers to process the queue, but this task was never scheduled
and was only woken up on session free. Second, there was no reservation
of server entries when a task was assigned a server. This means that
as long as the task was not connected to the server, its presence was
not accounted for. This was causing trouble when detecting whether or
not a server had reached maxconn. Third, during a redispatch, a session
could lose its place at the server's and get blocked because another
session at the same moment would have stolen the entry. Fourth, the
redispatch option did not work when maxqueue was reached for a server,
and it was not possible to do so without indefinitely hanging a session.
The root cause of all those problems was the lack of pre-reservation of
connections at the server's, and the lack of tracking of servers during
a redispatch. Everything relied on combinations of flags which could
appear similarly in quite distinct situations.
This patch is a major rework but there was no other solution, as the
internal logic was deeply flawed. The resulting code is cleaner, more
understandable, uses less magics and is overall more robust.
As an added bonus, "option redispatch" now works when maxqueue has
been reached on a server.
This patch adds two optional arguments "len" and "depth" to
"balance uri". They are used to limit the length in characters
of the analysis, as well as the number of directory components
it applies to.
This patch extends the "url_param" load balancing method by introducing
the "check_post" option. Using this option enables analysis of the beginning
of POST requests to search for the specified URL parameter.
The patch also fixes a few minor typos in comments that were discovered
during code review.
The new "leastconn" LB algorithm selects the server which has the
least established or pending connections. The weights are considered,
so that a server with a weight of 20 will get twice as many connections
as the server with a weight of 10.
The algorithm respects the minconn/maxconn settings, as well as the
slowstart since it is a dynamic algorithm. It also correctly supports
backup servers (one and all).
It is generally suited for protocols with long sessions (such as remote
terminals and databases), as it will ensure that upon restart, a server
with no connection will take all new ones until its load is balanced
with others.
A test configuration has been added in order to ease regression testing.
Commit 3168223a7b broke option
"allbackups" in roundrobin mode due to an erroneous structure
member replacement in backend.c. The PR_O_USE_ALL_BK flag was
not tested in the right member anymore.
This bug uncoverred another one, by which all backup servers would
be used whatever the option's value, if all of them had been seen
as simultaneously failed at one moment.
This patch fixes the two stupid errors. Correctness has been tested
using the test-fwrr.cfg config example.
When haproxy decides that session needs to be redispatched it chose a server,
but there is no guarantee for it to be a different one. So, it often
happens that selected server is exactly the same that it was previously, so
a client ends up with a 503 error anyway, especially when one sever has
much bigger weight than others.
Changes from the previous version:
- drop stupid and unnecessary SN_DIRECT changes
- assign_server(): use srvtoavoid to keep the old server and clear s->srv
so SRV_STATUS_NOSRV guarantees that t->srv == NULL (again)
and get_server_rr_with_conns has chances to work (previously
we were passing a NULL here)
- srv_redispatch_connect(): remove t->srv->cum_sess and t->srv->failed_conns
incrementing as t->srv was guaranteed to be NULL
- add avoididx to get_server_rr_with_conns. I hope I correctly understand this code.
- fix http_flush_cookie_flags() and move it to assign_server_and_queue()
directly. The code here was supposed to set CK_DOWN and clear CK_VALID,
but: (TX_CK_VALID | TX_CK_DOWN) == TX_CK_VALID == TX_CK_MASK so:
if ((txn->flags & TX_CK_MASK) == TX_CK_VALID)
txn->flags ^= (TX_CK_VALID | TX_CK_DOWN);
was really a:
if ((txn->flags & TX_CK_MASK) == TX_CK_VALID)
txn->flags &= TX_CK_VALID
Now haproxy logs "--DI" after redispatching connection.
- defer srv->redispatches++ and s->be->redispatches++ so there
are called only if a conenction was redispatched, not only
supposed to.
- don't increment lbconn if redispatcher selected the same sarver
- don't count unsuccessfully redispatched connections as redispatched
connections
- don't count redispatched connections as errors, so:
- the number of connections effectively served by a server is:
srv->cum_sess - srv->failed_conns - srv->retries - srv->redispatches
and
SUM(servers->failed_conns) == be->failed_conns
- requires the "Don't increment server connections too much + fix retries" patch
- needs little more testing and probably some discussion so reverting to the RFC state
Tests #1:
retries 4
redispatch
i) 1 server(s): b (wght=1, down)
b) sessions=5, lbtot=1, err_conn=1, retr=4, redis=0
-> request failed
ii) server(s): b (wght=1, down), u (wght=1, down)
b) sessions=4, lbtot=1, err_conn=0, retr=3, redis=1
u) sessions=1, lbtot=1, err_conn=1, retr=0, redis=0
-> request FAILED
iii) 2 server(s): b (wght=1, down), u (wght=1, up)
b) sessions=4, lbtot=1, err_conn=0, retr=3, redis=1
u) sessions=1, lbtot=1, err_conn=0, retr=0, redis=0
-> request OK
iv) 2 server(s): b (wght=100, down), u (wght=1, up)
b) sessions=4, lbtot=1, err_conn=0, retr=3, redis=1
u) sessions=1, lbtot=1, err_conn=0, retr=0, redis=0
-> request OK
v) 1 server(s): b (down for first 4 SYNS)
b) sessions=5, lbtot=1, err_conn=0, retr=4, redis=0
-> request OK
Tests #2:
retries 4
i) 1 server(s): b (down)
b) sessions=5, lbtot=1, err_conn=1, retr=4, redis=0
-> request FAILED
Commit 98937b8757 while fixing
one bug introduced another one. With "retries 4" and
"option redispatch" haproxy tries to connect 4 times to
one server server and 1 time to a second one. However
logs showed 5 connections to the first server (the
last one was counted twice) and 2 to the second.
This patch also fixes srv->retries and be->retries increments.
Now I get: 3 retries and 1 error in a first server (4 cum_sess)
and 1 error in a second server (1 cum_sess) with:
retries 4
option redispatch
and: 4 retries and 1 error (5 cum_sess) with:
retries 4
So, the number of connections effectively served by a server is:
srv->cum_sess - srv->failed_conns - srv->retries
GCC4 is stupid (unbelievable news!).
When some code uses __builtin_expect(x != 0, 1), it really performs
the check of x != 0 then tests that the result is not zero! This is
a double check when only one was expected. Some performance drops
of 10% in the HTTP parser code have been observed due to this bug.
GCC 3.4 is fine though.
A solution consists in expecting that the tested value is 1. In
this case, it emits the correct code, but it's still not optimal
it seems. Finally the best solution is to ignore likely() and to
pray for the compiler to emit correct code. However, we still have
to fix unlikely() to remove the test there too, and to fix all
code which passed pointers overthere to pass integers instead.
Now when a server has "redir <prefix>" on its config line, any HEAD or GET
request addressing it will lead to a 302 with Location set to "<prefix>"
immediately followed by the relative URI of the incoming request. This makes
it very easy to send redirect to browsers to check remote static servers, as
well as to provide redirection for remote sites when the local one is down.
There's no point trying to check original dest addr with only one
method when doing transparent proxy as in full transparent mode,
the real destination address is required. Let's copy the one from
the frontend.
The source address selection for health checks did not consider
the new transparent proxy method. Rely on the same unified function
as the other connect() calls.
This patch also fixes a bug by which the proxy's source address was
ignored if cttproxy was used.
Balabit's TPROXY version 4 which replaces CTTPROXY provides a similar
API to the previous proxy, but relies on IP_FREEBIND instead of
IP_TRANSPARENT. Let's add it.
Using some Linux kernel patches which add the IP_TRANSPARENT
SOL_IP option , it is possible to bind to a non-local address
on without having resort to any sort of NAT, thus causing no
performance degradation.
This is by far faster and cleaner than the previous CTTPROXY
method. The code has been slightly changed in order to remain
compatible with CTTPROXY as a fallback for the new method when
it does not work.
It is not needed anymore to specify the outgoing source address
for connect, it can remain 0.0.0.0.
This patch extends a little previously added functionality to also
count retries and redispatches for servers. Now it is possible to know
which server causes redispatches as it is not always the same that takes
most retries.
While working with the code I found that redistribute_pending() does not increment
srv->redispatches && be->redispatches. I don't know how to test it but
I think the fix is correct. If not I can withdraw it.
I also extended logs to show how many retries were done and if redispatching
was necessary ('+'). I'm using an additional session flag SN_REDISP to match
redispatched connections. I had to rearrange all defines in session.h to make
more room for it.
The documentation about logs was also fixed a little (sorry, english only),
as current version uses totally different format. BTW: examples are still
outdated, maybe next time...
Finally, I changed %d -> %u for retries/redispatches as those variables
are declared as unsigned.
It was abnormal to see more connect errors than connect attempts.
This was caused by the fact that the server's connection count was
not incremented for failed connect() attempts.
Now the per-server connections are correctly incremented for each
connect() attempt. This includes the retries too. The number of
connections effectively served by a server will then be :
srv->cum_sess - srv->errors - srv->warnings
One user reported that an indicator was missing in the statistics:
the number of times each server was selected by load balancing. It
is in fact the total number of sessions assigned to a server by the
load balancing algorithm. It should directly reflect the weight for
"fair" algorithms such as round-robin, since it will not account for
persistant connections.
It should help a lot tuning each server's weight depending on the
load it receives.
Now the connect timeout, tarpit timeout and queue timeout are
distinct. In order to retain compatibility with older versions,
if either queue or tarpit is left unset both in the proxy and
in the default proxy, then it is inherited from the connect
timeout as before.
Now we can compute the max place depending on the number of servers,
maximum weight and weight scale. The formula has been stored as a
comment so that it's easy to choose between smooth weight ramp up
and high number of servers. The default scale has been set to 16,
which permits 4000 servers with a granularity of 6% in the worst
case (weight=1).
The new "nbsrv" ACL verb matches the number of active servers in a backend.
By default, it applies to the backend where it is declared, but optionally
it can receive the name of another backend as an argument in parenthesis.
It counts the number of enabled active servers first, then the number of
enabled backup servers.
It's not always obvious for the callers of set_server_status_{up,down}
whether the new state really is up or down. Some flags as well as the
effective weight have to be considered. Let's ensure that those functions
perform the necessary check themselves so that if the state transition
cannot be performed, at least everything is updated as required.
When an HTTP server returns "404 not found", it indicates that at least
part of it is still running. For this reason, it can be convenient for
application administrators to be able to consider code 404 as valid,
but for a server which does not want to participate to load balancing
anymore. This is useful to seamlessly exclude a server from a farm
without acting on the load balancer. For instance, let's consider that
haproxy checks for the "/alive" file. To enable load balancing on a
server, the admin would simply do :
# touch /var/www/alive
And to disable the server, he would simply do :
# rm /var/www/alive
Another immediate gain from doing this is that it is now possible to
send NOTICE messages instead of ALERT messages when a server is first
disable, then goes down. This provides a graceful shutdown method.
To enable this behaviour, specify "http-check disable-on-404" in the
backend.
Hello,
You will find attached an updated release of previously submitted patch.
It polish some part and extend ACL engine to match IP and PORT parsed in
HTTP request. (and take care of comments made by Willy ! ;))
Best regards,
Alexandre
The number of possible options for a proxy has already reached
32, which is the current limit due to the fact that they are
each represented as a bit in a 32-bit word.
It's possible to move the load balancing algorithms to another
place. It will also save some space for future algorithms.
This round robin algorithm was written from trees, so that we
do not have to recompute any table when changing server weights.
This solution allows on-the-fly weight adjustments with immediate
effect on the load distribution.
There is still a limitation due to 32-bit computations, to about
2000 servers at full scale (weight 255), or more servers with
lower weights. Basically, sum(srv.weight)*4096 must be below 2^31.
Test configurations and an example program used to develop the
tree will be added next.
Many changes have been brought to the weights computations and
variables in order to accomodate for the possiblity of a server to
be running but disabled from load balancing due to a null weight.
Under some circumstances, it will be useful to be able to have
a server's effective weight bigger than the user weight, and this
is particularly true for dynamic weight-based algorithms. In order
to support this, we add a "wdiv" member to the lbprm structure
which will always be used to divide the weights before reporting
them.