These functions will be needed by "show sess" on the CLI, let's make them
globally available. It's important to note that due to the fact that we
still do not set the data and transport layers' names in the structures,
we still have to rely on some exports just to match the pointers. This is
ugly but is preferable to adding many includes since the short-term goal
is to get rid of these tests by having proper names in place.
Setting an FD to -1 when closed isn't the most easily noticeable thing
to do when we're chasing accidental reuse of a stale file descriptor.
Instead set it to that large a negative value that it will overflow the
fdtab and provide an analysable core at the moment the issue happens.
Care was taken to ensure it doesn't overflow nor change sign on 32-bit
machines when multiplied by fdtab, and that it also remains negative for
the various checks that exist. The value equals 0xFDDEADFD which happens
to be easily spotted in a debugger.
The bug described in commit 568743a ("BUG/MEDIUM: stream-int: completely
detach connection on connect error") was not a stream-interface layer bug
but a connection layer bug. There was exactly one place in the code where
we could change a file descriptor's status without first checking whether
it is valid or not, it was in conn_stop_polling(). This one is called when
the polling status is changed after an update, and calls fd_stop_both even
if we had already closed the file descriptor :
1479388298.484240 ->->->->-> conn_fd_handler > conn_cond_update_polling
1479388298.484240 ->->->->->-> conn_cond_update_polling > conn_stop_polling
1479388298.484241 ->->->->->->-> conn_stop_polling > conn_ctrl_ready
1479388298.484241 conn_stop_polling < conn_ctrl_ready
1479388298.484241 ->->->->->->-> conn_stop_polling > fd_stop_both
1479388298.484242 ->->->->->->->-> fd_stop_both > fd_update_cache
1479388298.484242 ->->->->->->->->-> fd_update_cache > fd_release_cache_entry
1479388298.484242 fd_update_cache < fd_release_cache_entry
1479388298.484243 fd_stop_both < fd_update_cache
1479388298.484243 conn_stop_polling < fd_stop_both
1479388298.484243 conn_cond_update_polling < conn_stop_polling
1479388298.484243 conn_fd_handler < conn_cond_update_polling
The problem with the previous fix above is that it break the http_proxy mode
and possibly even some Lua parts and peers to a certain extent ; all outgoing
connections where the target address is initially copied into the outgoing
connection which experience a retry would use a random outgoing address after
the retry because closing and detaching the connection causes the target
address to be lost. This was attempted to be addressed by commit 0857d7a
("BUG/MAJOR: stream: properly mark the server address as unset on connect
retry") but it used to only solve the most visible effect and not the root
cause.
Prior to this fix, it was possible to cause this config to keep CLOSE_WAIT
for as long as it takes to expire a client or server timeout (note the
missing client timeout) :
listen test
mode http
bind :8002
server s1 127.0.0.1:8001
$ tcploop 8001 L0 W N20 A R P100 S:"HTTP/1.1 200 OK\r\nContent-length: 0\r\n\r\n" &
$ tcploop 8002 N200 C T W S:"GET / HTTP/1.0\r\n\r\n" O P10000 K
With this patch, these CLOSE_WAIT properly vanish when both processes leave.
This commit reverts the two fixes above and replaces them with the proper
fix in connection.h. It must be backported to 1.6 and 1.5. Thanks to
Robson Roberto Souza Peixoto for providing very detailed traces showing
some obvious inconsistencies leading to finding this bug.
This pointer will be used for storing private context. With this,
the same executed function can handle more than one keyword. This
will be very useful for creation Lua cli bindings.
The release function is called when the command is terminated (give
back the hand to the prompt) or when the session is broken (timeout
or client closed).
A new "option spop-check" statement has been added to enable server health
checks based on SPOP HELLO handshake. SPOP is the protocol used by SPOE filters
to talk to servers.
It does the opposite of 'set-var' action/converter. It is really useful for
per-process variables. But, it can be used for any scope.
The lua function 'unset_var' has also been added.
This function, unsurprisingly, sets a variable value only if it already
exists. In other words, this function will succeed only if the variable was
found somewhere in the configuration during HAProxy startup.
It will be used by SPOE filter. So an agent will be able to set a value only for
existing variables. This prevents an agent to create a very large number of
unused variables to flood HAProxy and exhaust the memory reserved to variables..
This code has been moved from haproxy.c to sample.c and the function
release_sample_expr can now be called from anywhere to release a sample
expression. This function will be used by the stream processing offload engine
(SPOE).
This new setting supports a comma-delimited list of methods used to
resolve the server's FQDN to an IP address. Currently supported methods
are "libc" (use the regular libc's resolver) and "last" (use the last
known valid address found in the state file).
The list is implemented in a 32-bit integer, because each init-addr
method only requires 3 bits. The last one must always be SRV_IADDR_END
(0), allowing to store up to 10 methods in a single 32 bit integer.
Note: the doc is provided at the end of this series.
It will be important to help debugging some DNS resolution issues to
know why a server was marked down, so let's make the function support
a 3rd argument with an indication of the reason. Passing NULL will keep
the message as-is.
Server addresses are not resolved anymore upon the first pass so that we
don't fail if an address cannot be resolved by the libc. Instead they are
processed all at once after the configuration is fully loaded, by the new
function srv_init_addr(). This function only acts on the server's address
if this address uses an FQDN, which appears in server->hostname.
For now the function does two things, to followup with HAProxy's historical
default behavior:
1. apply server IP address found in server-state file if runtime DNS
resolution is enabled for this server
2. use the DNS resolver provided by the libc
If none of the 2 options above can find an IP address, then an error is
returned.
All of this will be needed to support the new server parameter "init-addr".
For now, the biggest user-visible change is that all server resolution errors
are dumped at once instead of causing a startup failure one by one.
In the last release a lot of the structures have become opaque for an
end user. This means the code using these needs to be changed to use the
proper functions to interact with these structures instead of trying to
manipulate them directly.
This does not fix any deprecations yet that are part of 1.1.0, it only
ensures that it can be compiled against that version and is still
compatible with older ones.
[wt: openssl-0.9.8 doesn't build with it, there are conflicts on certain
function prototypes which we declare as inline here and which are
defined differently there. But openssl-0.9.8 is not supported anymore
so probably it's OK to go without it for now and we'll see later if
some users still need it. Emeric has reviewed this change and didn't
spot anything obvious which requires special care. Let's try it for
real now]
The only reason wurfl/wurfl.h was needed outside of wurfl.c was to expose
wurfl_handle which is a pointer to a structure, referenced by global.h.
By just storing a void* there instead, we can confine all wurfl code to
wurfl.c, which is really nice.
Right now there is an issue with the way the maintenance flags are
propagated upon startup. They are not propagate, just copied from the
tracked server. This implies that depending on the server's order, some
tracking servers may not be marked down. For example this configuration
does not work as expected :
server s1 1.1.1.1:8000 track s2
server s2 1.1.1.1:8000 track s3
server s3 1.1.1.1:8000 track s4
server s4 wtap:8000 check inter 1s disabled
It results in s1/s2 being up, and s3/s4 being down, while all of them
should be down.
The only clean way to process this is to run through all "root" servers
(those not tracking any other server), and to propagate their state down
to all their trackers. This is the same algorithm used to propagate the
state changes. It has to be done both to compute the IDRAIN flag and the
IMAINT flag. However, doing so requires that tracking servers are not
marked as inherited maintenance anymore while parsing the configuration
(and given that it is wrong, better drop it).
This fix also addresses another side effect of the bug above which is
that the IDRAIN/IMAINT flags are stored in the state files, and if
restored while the tracked server doesn't have the equivalent flag,
the servers may end up in a situation where it's impossible to remove
these flags. For example in the configuration above, after removing
"disabled" on server s4, the other servers would have remained down,
and not anymore with this fix. Similarly, the combination of IMAINT
or IDRAIN with their respective forced modes was not accepted on
reload, which is wrong as well.
This bug has been present at least since 1.5, maybe even 1.4 (it came
with tracking support). The fix needs to be backported there, though
the srv-state parts are irrelevant.
This commit relies on previous patch to silence warnings on startup.
We used to have 7 different character classes, each was 256 bytes long,
resulting in almost 2kB being used in the L1 cache. It's as cheap to
test a bit than to check the byte is not null, so let's store a 7-bit
composite value and check for the respective bits there instead.
The executable is now 4 kB smaller and the performance on small
objects increased by about 1% to 222k requests/second with a config
involving 4 http-request rules including 1 header lookup, one header
replacement, and 2 variable assignments.
This commit introduces "tcp-request session" rules. These are very
much like "tcp-request connection" rules except that they're processed
after the handshake, so it is possible to consider SSL information and
addresses rewritten by the proxy protocol header in actions. This is
particularly useful to track proxied sources as this was not possible
before, given that tcp-request content rules are processed after each
HTTP request. Similarly it is possible to assign the proxied source
address or the client's cert to a variable.
This is in order to make integration of tcp-request-session cleaner :
- tcp_exec_req_rules() was renamed tcp_exec_l4_rules()
- LI_O_TCP_RULES was renamed LI_O_TCP_L4_RULES
(LI_O_*'s horrible indent was also fixed and a provision was left
for L5 rules).
These are denied conns. Strangely this wasn't emitted while it used to be
available for a while. It corresponds to the number of connections blocked
by "tcp-request connection reject".
To register a new cli keyword, you need to declare a cli_kw_list
structure in your source file:
static struct cli_kw_list cli_kws = {{ },{
{ { "test", "list", NULL }, "test list : do some tests on the cli", test_parsing, NULL },
{ { NULL }, NULL, NULL, NULL, NULL }
}};
And then register it:
cli_register_kw(&cli_kws);
The first field is an array of 5 elements, where you declare the
keywords combination which will match, it must be ended by a NULL
element.
The second field is used as a usage message, it will appear in the help
of the cli, you can set it to NULL if you don't want to show it, it's a
good idea if you want to overwrite some existing keywords.
The two last fields are callbacks.
The first one is used at parsing time, you can use it to parse the
arguments of your keywords and print small messages. The function must
return 1 in case of a failure, otherwise 0:
#include <proto/dumpstats.h>
static int test_parsing(char **args, struct appctx *appctx)
{
struct chunk out;
if (!*args[2]) {
appctx->ctx.cli.msg = "Error: the 3rd argument is mandatory !";
appctx->st0 = STAT_CLI_PRINT;
return 1;
}
chunk_reset(&trash);
chunk_printf(&trash, "arg[3]: %s\n", args[2]);
chunk_init(&out, NULL, 0);
chunk_dup(&out, &trash);
appctx->ctx.cli.err = out.str;
appctx->st0 = STAT_CLI_PRINT_FREE; /* print and free in the default cli_io_handler */
return 0;
}
The last field is the IO handler callback, it can be set to NULL if you
want to use the default cli_io_handler() otherwise you can write your
own. You can use the private pointer in the appctx if you need to store
a context or some data. stats_dump_sess_to_buffer() is a good example of
IO handler, IO handlers often use the appctx->st2 variable for the state
machine. The handler must return 0 in case it have to be recall later
otherwise 1.
During the stick-table teaching process which occurs at reloading/restart time,
expiration dates of stick-tables entries were not synchronized between peers.
This patch adds two new stick-table messages to provide such a synchronization feature.
As these new messages are not supported by older haproxy peers protocol versions,
this patch increments peers protol version, from 2.0 to 2.1, to help in detecting/supporting
such older peers protocol implementations so that new versions might still be able
to transparently communicate with a newer one.
[wt: technically speaking it would be nice to have this backported into 1.6
as some people who reload often are affected by this design limitation, but
it's not a totally transparent change that may make certain users feel
reluctant to upgrade older versions. Let's let it cook in 1.7 first and
decide later]
New DNS response parser function which turn the DNS response from a
network buffer into a DNS structure, much easier for later analysis
by upper layer.
Memory is pre-allocated at start-up in a chunk dedicated to DNS
response store.
New error code to report a wrong number of queries in a DNS response.
This function can replace update_server_addr() where the need to change the
server's port as well as the IP address is required.
It performs some validation before performing each type of change.
HAProxy used to deduce port used for health checks when parsing configuration
at startup time.
Because of this way of working, it makes it complicated to change the port at
run time.
The current patch changes this behavior and makes HAProxy to choose the
port used for health checking when preparing the check task itself.
A new type of error is introduced and reported when no port can be found.
There won't be any impact on performance, since the process to find out the
port value is made of a few 'if' statements.
This patch also introduces a new check state CHK_ST_PORT_MISS: this flag is
used to report an error in the case when HAProxy needs to establish a TCP
connection to a server, to perform a health check but no TCP ports can be
found for it.
And last, it also introduces a new stream termination condition:
SF_ERR_CHK_PORT. Purpose of this flag is to report an error in the event when
HAProxy has to run a health check but no port can be found to perform it.
At some places, smp_dup() is inappropriately called to ensure a modification
is possible while in fact we only need to ensure the sample may be modified
in place. Let's provide smp_is_rw() to check for this capability and
smp_make_rw() to perform the smp_dup() when it is not the case.
Note that smp_is_rw() will also try to add the trailing zero on strings when
needed if possible, to avoid a useless duplication.
These functions ensure that the designated sample is "safe for use",
which means that its size is known, its length is correct regarding its
size, and that strings are properly zero-terminated.
smp_is_safe() only checks (and optionally sets the trailing zero when
needed and possible). smp_make_safe() will call smp_dup() after
smp_is_safe() fails.
This enables tracking of sticky counters from current response. The only
difference from "http-request track-sc" is the <key> sample expression
can only make use of samples in response (eg. res.*, status etc.) and
samples below Layer 6.
When NetScaler application switch is used as L3+ switch, informations
regarding the original IP and TCP headers are lost as a new TCP
connection is created between the NetScaler and the backend server.
NetScaler provides a feature to insert in the TCP data the original data
that can then be consumed by the backend server.
Specifications and documentations from NetScaler:
https://support.citrix.com/article/CTX205670https://www.citrix.com/blogs/2016/04/25/how-to-enable-client-ip-in-tcpip-option-of-netscaler/
When CIP is enabled on the NetScaler, then a TCP packet is inserted just after
the TCP handshake. This is composed as:
- CIP magic number : 4 bytes
Both sender and receiver have to agree on a magic number so that
they both handle the incoming data as a NetScaler Client IP insertion
packet.
- Header length : 4 bytes
Defines the length on the remaining data.
- IP header : >= 20 bytes if IPv4, 40 bytes if IPv6
Contains the header of the last IP packet sent by the client during TCP
handshake.
- TCP header : >= 20 bytes
Contains the header of the last TCP packet sent by the client during TCP
handshake.
When compiled with GCC 6, the IP address specified for a frontend was
ignored and HAProxy was listening on all addresses instead. This is
caused by an incomplete copy of a "struct sockaddr_storage".
With the GNU Libc, "struct sockaddr_storage" is defined as this:
struct sockaddr_storage
{
sa_family_t ss_family;
unsigned long int __ss_align;
char __ss_padding[(128 - (2 * sizeof (unsigned long int)))];
};
Doing an aggregate copy (ss1 = ss2) is different than using memcpy():
only members of the aggregate have to be copied. Notably, padding can be
or not be copied. In GCC 6, some optimizations use this fact and if a
"struct sockaddr_storage" contains a "struct sockaddr_in", the port and
the address are part of the padding (between sa_family and __ss_align)
and can be not copied over.
Therefore, we replace any aggregate copy by a memcpy(). There is another
place using the same pattern. We also fix a function receiving a "struct
sockaddr_storage" by copy instead of by reference. Since it only needs a
read-only copy, the function is converted to request a reference.
'channel_analyze' callback has been removed. Now, there are 2 callbacks to
surround calls to analyzers:
* channel_pre_analyze: Called BEFORE all filterable analyzers. it can be
called many times for the same analyzer, once at each loop until the
analyzer finishes its processing. This callback is resumable, it returns a
negative value if an error occurs, 0 if it needs to wait, any other value
otherwise.
* channel_post_analyze: Called AFTER all filterable analyzers. Here, AFTER
means when an analyzer finishes its processing. This callback is NOT
resumable, it returns a negative value if an error occurs, any other value
otherwise.
Pre and post analyzer callbacks are not automatically called. 'pre_analyzers'
and 'post_analyzers' bit fields in the filter structure must be set to the right
value using AN_* flags (see include/types/channel.h).
The flag AN_RES_ALL has been added (AN_REQ_ALL already exists) to ease the life
of filter developers. AN_REQ_ALL and AN_RES_ALL include all filterable
analyzers.
Now, to call an analyzer in 'process_stream' function, we should use
FLT_ANALAYZE or ANALYZE macros, depending if this is a filterable analyzer or
not.
In 1.4-dev3, commit 31971e5 ("[MEDIUM] add support for infinite forwarding")
made it possible to configure the lower layer to forward data indefinitely
by setting the forward size to CHN_INFINITE_FORWARD (4GB-1). By then larger
chunk sizes were not supported so there was no confusion in the usage of the
function.
Since 1.5 we support 64-bit content-lengths and chunk sizes and the function
has grown to support 64-bit arguments, though it still limits a single pass
to 32-bit quantities (what fit in the channel's to_forward field). The issue
now becomes that a 4GB-1 content-length can be confused with infinite
forwarding (in fact it's 4GB-1+what was already in the buffer). It causes a
visible effect when transferring this exact size because the transfer rate
is lower than with other sizes due in part to the disabling of the Nagle
algorithm on the sendto() call.
In theory with keep-alive it should prevent a second request from being
processed after such a transfer, but since the analysers are still present,
the forwarding analyser properly counts down the remaining size to transfer
and ultimately the transaction gets correctly reset so there is no visible
effect.
Since the root cause of the issue is an API problem (lack of distinction
between a real valid length and a magic value), this patch modifies the API
to have a new dedicated function called channel_forward_forever() to program
a permanent forwarding. The existing function __channel_forward() was modified
to properly take care of the requested sizes and ensure it 1) never overflows
and 2) never reaches CHN_INFINITE_FORWARD by accident.
It is worth noting that the function used to have a bug causing a 2GB
forward to be scheduled if it was called with less data than what is present
in buf->i. Fortunately this bug couldn't be triggered with existing code.
This fix should be backported to 1.6 and 1.5. While it also theorically
affects 1.4, it's better not to backport it there, as the risk of breaking
large object transfers due to significant API differences is high, compared
to the fact that the largest supported objects (4GB-1) are just slower to
transfer.
Unfortunately, commit 169c470 ("BUG/MEDIUM: channel: fix miscalculation of
available buffer space (3rd try)") was still not enough to completely
address the issue. It fell into an integer comparison trap. Contrary to
expectations, chn->to_forward may also have the sign bit set when
forwarding regular data having a large content-length, resulting in
an incomplete check of the result and of the reserve because the with
to_forward very large, to_forward+o could become very small and also
the reserve could become positive again and make channel_recv_limit()
return a negative value.
One way to reproduce this situation is to transfer a large file (> 2GB)
with http-keep-alive or http-server-close, without splicing, and ensure
that the server uses content-length instead of chunks. The transfer
should stall very early after the first buffer has been transferred
to the client.
This fix now properly checks 1) for an overflow caused by summing o and
to_forward, and 2) for o+to_forward being smaller or larger than maxrw
before performing the subtract, so that all sensitive operations are
properly performed on 33-bit arithmetics.
The code was subjected again to a series of tests using inject+httpterm
scanning a wide range of object sizes (+10MB after each new request) :
$ printf "new page 1\nget 127.0.0.1:8002 / s=%%s0m\n" | \
inject64 -o 1 -u 1 -f /dev/stdin
With previous fix, the transfer would suddenly stop when reaching 2GB :
hits ^hits hits/s ^h/s bytes kB/s last errs tout htime sdht ptime
203 1 2 1 216816173354 2710202 3144892 0 0 685.0 0.0 685.0
205 2 2 2 219257283186 2706880 2441109 0 0 679.5 6.5 679.5
205 0 2 0 219257283186 2673836 0 0 0 0.0 0.0 0.0
205 0 2 0 219257283186 2641622 0 0 0 0.0 0.0 0.0
205 0 2 0 219257283186 2610174 0 0 0 0.0 0.0 0.0
Now it's fine even past 4 GB.
Many thanks to Vedran Furac for reporting this issue early with a common
access pattern helping to troubleshoot this.
This fix must be backported to 1.6 and 1.5 where the commit above was
already backported.
This function returns non-zero if the channel is congested with data in
transit waiting for leaving, indicating to the caller that it should wait
for the reserve to be released before starting to process new data in
case it needs the ability to append data. This is meant to be used while
waiting for a clean response buffer before processing a request.
This is very useful in complex architecture systems where HAproxy
is balancing DB connections for example. We want to keep the maxconn
high in order to avoid issues with queueing on the LB level when
there is slowness on another part of the system. Example is a case of
an architecture where each thread opens multiple DB connections, which
if get stuck in queue cause a snowball effect (old connections aren't
closed, new ones cannot be established). These connections are mostly
idle and the DB server has no problem handling thousands of them.
Allowing us to dynamically set maxconn depending on the backend usage
(LA, CPU, memory, etc.) enables us to have high maxconn for situations
like above, but lowering it in case there are real issues where the
backend servers become overloaded (cache issues, DB gets hit hard).
Latest fix 8a32106 ("BUG/MEDIUM: channel: fix miscalculation of available
buffer space (2nd try)") did happen to fix some observable issues but not
all of them in fact, some corner cases still remained and at least one user
reported a busy loop that appeared possible, though not easily reproducible
under experimental conditions.
The remaining issue is that we still consider min(i, to_fwd) as the number
of bytes in transit, but in fact <i> is not relevant here. Indeed, what
matters is that we can read everything we want at once provided that at
the end, <i> cannot be larger than <size-maxrw> (if it was not already).
This is visible in two cases :
- let's have i=o=max/2 and to_fwd=0. Then i+o >= max indicates that the
buffer is already full, while it is not since once <o> is forwarded,
some space remains.
- when to_fwd is much larger than i, it's obvious that we can fill the
buffer.
The only relevant part in fact is o + to_fwd. to_fwd will ensure that at
least this many bytes will be moved from <i> to <o> hence will leave the
buffer, whatever the number of rounds it takes.
Interestingly, the fix applied here ensures that channel_recv_max() will
now equal (size - maxrw - i + to_fwd), which is indeed what remains
available below maxrw after to_fwd bytes are forwarded from i to o and
leave the buffer.
Additionally, the latest fix made it possible to meet an integer overflow
that was not caught by the range test when forwarding in TCP or tunnel
mode due to to_forward being added to an existing value, causing the
buffer size to be limited when it should not have been, resulting in 2
to 3 recv() calls when a single one was enough. The first one was limited
to the unreserved buffer size, the second one to the size of the reserve
minus 1, and the last one to the last byte. Eg with a 2kB buffer :
recvfrom(22, "HTTP/1.1 200\r\nConnection: close\r"..., 1024, 0, NULL, NULL) = 1024
recvfrom(22, "23456789.123456789.123456789.123"..., 1023, 0, NULL, NULL) = 1023
recvfrom(22, "5", 1, 0, NULL, NULL) = 1
This bug is still present in 1.6 and 1.5 so the fix should be backported
there.
The condition to poll for receive as implemented in channel_may_recv()
is still incorrect. If buf->o is null and buf->i is slightly larger than
chn->to_forward and at least as large as buf->size - maxrewrite, then
reading will be disabled. It may slightly delay some data delivery by
having first to forward pending bytes, but may also cause some random
issues with analysers that wait for some data before starting to forward
what they correctly parsed. For instance, a body analyser may be prevented
from seeing the data that only fits in the reserve.
This bug may also prevent an applet's chk_rcv() function from being called
when part of a buffer is released. It is possible (though not verified)
that this participated to some peers frozen session issues some people
have been facing.
This fix should be backported to 1.6 and 1.5 to ensure better coherency
with channel_recv_limit().
Commit 9c06ee4 ("BUG/MEDIUM: channel: don't schedule data in transit for
leaving until connected") took care of an issue involving POST in conjunction
with http-send-name-header, where we absolutely never want to touch the
reserve until we're sure not to touch the buffer contents anymore, which
is indicated by the output stream-interface being connected.
But channel_may_recv() was not equipped with such a test, so in some
situations it might decide that it is possible to poll for reads, and
later channel_recv_limit() will decide it's not possible to read,
causing a loop. So we must add a similar test there.
Since the fix above was backported to 1.6 and 1.5, this fix must as well.
There's quite some inconsistency in the internal API. listener_accept()
which is the main accept() function returns void but is declared as int
in the include file. It's assigned to proto->accept() for all stream
protocols where an int is expected but the result is never checked (nor
is it documented by the way). This proto->accept() is in turn assigned
to fd->iocb() which is supposed to return an int composed of FD_WAIT_*
flags, but which is never checked either.
So let's fix all this mess :
- nobody checks accept()'s return
- nobody checks iocb()'s return
- nobody sets a return value
=> let's mark all these functions void and keep the current ones intact.
Additionally we now include listener.h from listener.c to ensure we won't
silently hide this incoherency in the future.
Note that this patch could/should be backported to 1.6 and even 1.5 to
simplify debugging sessions.
Commit 999f643 ("BUG/MEDIUM: channel: fix miscalculation of available buffer
space.") introduced a bug which made output data to be ignored when computing
the remaining room in a buffer. The problem is that channel_may_recv()
properly considers them and may declare that the FD may be polled for read
events, but once the even strikes, channel_recv_limit() called before recv()
says the opposite. In 1.6 and later this case is automatically caught by
polling loop detection at the connection level and is harmless. But the
backport in 1.5 ends up with a busy polling loop as soon as it becomes
possible to have a buffer with this conflict. In order to reproduce it, it
is necessary to have less than [maxrewrite] bytes available in a buffer, no
forwarding enabled (end of transfer) and [buf->o >= maxrewrite - free space].
Since this heavily depends on socket buffers, it will randomly strike users.
On 1.5 with 8kB buffers it was possible to reproduce it with httpterm using
the following command line :
$ (printf "GET /?s=675000 HTTP/1.0\r\n\r\n"; sleep 60) | \
nc6 --rcvbuf-size 1 --send-only 127.0.0.1 8002
This bug is only medium in 1.6 and later but is major in the 1.5 backport,
so it must be backported there.
Thanks to Nenad Merdanovic and Janusz Dziemidowicz for reporting this issue
with enough elements to help understand it.
This patch splits the function stats_dump_be_stats() in two parts. The
part is called stats_fill_be_stats(), and just fill the stats buffer.
This split allows the usage of preformated stats in other parts of HAProxy
like the Lua.
This patch splits the function stats_dump_sv_stats() in two parts. The
extracted part is called stats_fill_sv_stats(), and just fill the stats buffer.
This split allows the usage of preformated stats in other parts of HAProxy
like the Lua.
This patch splits the function stats_dump_li_stats() in two parts. The
extracted part is called stats_fill_li_stats(), and just fill the stats buffer.
This split allows the usage of preformated stats in other parts of HAProxy
like the Lua.
This patch splits the function stats_dump_fe_stats() in two parts. The
extracted part is called stats_fill_fe_stats(), and just fill the stats buffer.
This split allows the usage of preformated stats in other parts of HAProxy
like the Lua.
This patch splits the function stats_dump_info_to_buffer() in two parts. The
extracted part is called stats_fill_info(), and just fill the stats buffer.
This split allows the usage of preformated stats in other parts of HAProxy
like the Lua.
This patch adds a Lua post initialisation wrapper. It already exists for
pure Lua function, now it executes also C. It is useful for doing things
when the configuration is ready to use. For example we can can browse and
register all the proxies.
All the HAProxy Lua object are declared with the same pattern:
- Add the function __tosting which dumps the object name
- Register the name in the Lua REGISTRY
- Register the reference ID
These action are refactored in on function. This remove some
lines of code.
The functions
- hlua_class_const_int()
- hlua_class_const_str()
- hlua_class_function()
are use for common class registration actions.
The function 'hlua_dump_object()' is generic dump name function.
These functions can be used by all the HAProxy objects, so I move
it into the safe functions file.
This emits the field positions, names and types. It is more convenient
than the default output for a parser that doesn't know all the fields. It
simply relies on stats_emit_typed_data_field() and stats_emit_field_tags()
added by previous patch for the output. A new stats format flag was added,
STAT_FMT_TYPED, which is set when the "typed" keyword is specified on the
CLI.
New function stats_emit_typed_data_field() does exactly like
stats_emit_raw_data_field() except that it also prints the data
type after a colon. This will be used to print using the typed
format.
And function stats_emit_field_tags() appends a 3-letter code
describing the origin, nature, and scope, followed by an optional
delimiter. This will be particularly convenient to dump typed
data.
We're preparing for various data types for each stats field as they
appear in the CSV output. For now we only cover the regular types handled
by printf, so we have 32 and 64 bit ints and counters, strings, and of
course "empty" to indicate that there's nothing in the field and which
guarantees that any accessed entry will return 0.
More types will surely come later so that some fields are properly
represented. For example, we could see limits where only the value 0
doesn't show up, or human time, etc.
This is the continuation of previous patch called "BUG/MAJOR: samples:
check smp->strm before using it".
It happens that variables may have a session-wide scope, and that their
session is retrieved by dereferencing the stream. But nothing prevents them
from being used from a streamless context such as tcp-request connection,
thus crashing the process. Example :
tcp-request connection accept if { src,set-var(sess.foo) -m found }
In order to fix this, we have to always ensure that variable manipulation
only happens via the sample, which contains the correct owner and context,
and that we never use one from a different source. This results in quite a
large change since a lot of functions are inderctly involved in the call
chain, but the change is easy to follow.
This fix must be backported to 1.6, and requires the last two patches.
Since commit 6879ad3 ("MEDIUM: sample: fill the struct sample with the
session, proxy and stream pointers") merged in 1.6-dev2, the sample
contains the pointer to the stream and sample fetch functions as well
as converters use it heavily. This requires from a lot of call places
to initialize 4 fields, and it was even forgotten at a few places.
This patch provides a convenient helper to initialize all these fields
at once, making it easy to prepare a new sample from a previous one for
example.
A few call places were cleaned up to make use of it. It will be needed
by further fixes.
At one place in the Lua code, it was moved earlier because we used to
call sample casts with a non completely initialized sample, which is
not clean eventhough at the moment there are no consequences.
the function server_parse_addr_change_request() contain an hardcoded
updater source "stats command". this function can be called from other
sources than the "stats command", so this patch make this argument
generic.
The commit 87b096 renames the functions srv_shutdown_backup_sessions()
and srv_shutdown_sessions() to srv_shutdown_backup_streams() and
srv_shutdown_streams().
The header file <proto/servers.h> does not repport these changes.
This bug should be repported in the 1.6 branch, even if it is useless
because new dev are frozen.
DNS selection preferences are actually declared inline in the
struct server. There are copied from the server struct to the
dns_resolution struct for each resolution.
Next patchs adds new preferences options, and it is not a good
way to copy all the configuration information before each dns
resolution.
This patch extract the configuration preference from the struct
server and declares a new dedicated struct. Only a pointer to this
new striuict will be copied before each dns resolution.
The +E mode escapes characters '"', '\' and ']' with '\' as prefix. It
mostly makes sense to use it in the RFC5424 structured-data log formats.
Example:
log-format-sd %{+Q,+E}o\ [exampleSDID@1234\ header=%[capture.req.hdr(0)]]
This patch moves the function hlua_checkudata which check that
an object contains the expected class_reference as metatable.
This function is commonly used by all the lua functions.
The function hlua_metatype is also moved.
When Lua executes functions from its API, these can throws an error.
These function must be executed in a special environment which catch
these error, otherwise a critical error (like segfault) can raise.
This patch add a c file called "hlua_fcn.c" which collect all the
Lua/c function needing safe environment for its execution.
Now, filter's configuration (.id, .conf and .ops fields) is stored in the
structure 'flt_conf'. So proxies own a flt_conf list instead of a filter
list. When a filter is attached to a stream, it gets a pointer on its
configuration. This avoids mixing the filter's context (owns by a stream) and
its configuration (owns by a proxy). It also saves 2 pointers per filter
instance.
Before, functions to filter HTTP body (and TCP data) were called from the moment
at least one filter was attached to the stream. If no filter is interested by
these data, this uselessly slows data parsing.
A good example is the HTTP compression filter. Depending of request and response
headers, the response compression can be enabled or not. So it could be really
nice to call it only when enabled.
So, now, to filter HTTP/TCP data, a filter must use the function
register_data_filter. For TCP streams, this function can be called only
once. But for HTTP streams, when needed, it must be called for each HTTP request
or HTTP response.
Only registered filters will be called during data parsing. At any time, a
filter can be unregistered by calling the function unregister_data_filter.
From the stream point of view, this new structure is opaque. it hides filters
implementation details. So, impact for future optimizations will be reduced
(well, we hope so...).
Some small improvements has been made in filters.c to avoid useless checks.
This new analyzer will be called for each HTTP request/response, before the
parsing of the body. It is identified by AN_FLT_HTTP_HDRS.
Special care was taken about the following condition :
* the frontend is a TCP proxy
* filters are defined in the frontend section
* the selected backend is a HTTP proxy
So, this patch explicitly add AN_FLT_HTTP_HDRS analyzer on the request and the
response channels when the backend is a HTTP proxy and when there are filters
attatched on the stream.
This patch simplifies http_request_forward_body and http_response_forward_body
functions.
For Chunked HTTP request/response, the body filtering can be really
expensive. In the worse case (many chunks of 1 bytes), the filters overhead is
of 3 calls per chunk. If http_data callback is useful, others are just
informative.
So these callbacks has been removed. Of course, existing filters (trace and
compression) has beeen updated accordingly. For the HTTP compression filter, the
update is quite huge. Its implementation is closer to the old one.
When no filter is attached to the stream, the CPU footprint due to the calls to
filters_* functions is huge, especially for chunk-encoded messages. Using macros
to check if we have some filters or not is a great improvement.
Furthermore, instead of checking the filter list emptiness, we introduce a flag
to know if filters are attached or not to a stream.
HTTP compression has been rewritten to use the filter API. This is more a PoC
than other thing for now. It allocates memory to work. So, if only for that, it
should be rewritten.
In the mean time, the implementation has been refactored to allow its use with
other filters. However, there are limitations that should be respected:
- No filter placed after the compression one is allowed to change input data
(in 'http_data' callback).
- No filter placed before the compression one is allowed to change forwarded
data (in 'http_forward_data' callback).
For now, these limitations are informal, so you should be careful when you use
several filters.
About the configuration, 'compression' keywords are still supported and must be
used to configure the HTTP compression behavior. In absence of a 'filter' line
for the compression filter, it is added in the filter chain when the first
compression' line is parsed. This is an easy way to do when you do not use other
filters. But another filter exists, an error is reported so that the user must
explicitly declare the filter.
For example:
listen tst
...
compression algo gzip
compression offload
...
filter flt_1
filter compression
filter flt_2
...
HTTP compression will be moved in a true filter. To prepare the ground, some
functions have been moved in a dedicated file. Idea is to keep everything about
compression algos in compression.c and everything related to the filtering in
flt_http_comp.c.
For now, a header has been added to help during the transition. It will be
removed later.
Unused empty ACL keyword list was removed. The "compression" keyword
parser was moved from cfgparse.c to flt_http_comp.c.
This patch adds the support of filters in HAProxy. The main idea is to have a
way to "easely" extend HAProxy by adding some "modules", called filters, that
will be able to change HAProxy behavior in a programmatic way.
To do so, many entry points has been added in code to let filters to hook up to
different steps of the processing. A filter must define a flt_ops sutrctures
(see include/types/filters.h for details). This structure contains all available
callbacks that a filter can define:
struct flt_ops {
/*
* Callbacks to manage the filter lifecycle
*/
int (*init) (struct proxy *p);
void (*deinit)(struct proxy *p);
int (*check) (struct proxy *p);
/*
* Stream callbacks
*/
void (*stream_start) (struct stream *s);
void (*stream_accept) (struct stream *s);
void (*session_establish)(struct stream *s);
void (*stream_stop) (struct stream *s);
/*
* HTTP callbacks
*/
int (*http_start) (struct stream *s, struct http_msg *msg);
int (*http_start_body) (struct stream *s, struct http_msg *msg);
int (*http_start_chunk) (struct stream *s, struct http_msg *msg);
int (*http_data) (struct stream *s, struct http_msg *msg);
int (*http_last_chunk) (struct stream *s, struct http_msg *msg);
int (*http_end_chunk) (struct stream *s, struct http_msg *msg);
int (*http_chunk_trailers)(struct stream *s, struct http_msg *msg);
int (*http_end_body) (struct stream *s, struct http_msg *msg);
void (*http_end) (struct stream *s, struct http_msg *msg);
void (*http_reset) (struct stream *s, struct http_msg *msg);
int (*http_pre_process) (struct stream *s, struct http_msg *msg);
int (*http_post_process) (struct stream *s, struct http_msg *msg);
void (*http_reply) (struct stream *s, short status,
const struct chunk *msg);
};
To declare and use a filter, in the configuration, the "filter" keyword must be
used in a listener/frontend section:
frontend test
...
filter <FILTER-NAME> [OPTIONS...]
The filter referenced by the <FILTER-NAME> must declare a configuration parser
on its own name to fill flt_ops and filter_conf field in the proxy's
structure. An exemple will be provided later to make it perfectly clear.
For now, filters cannot be used in backend section. But this is only a matter of
time. Documentation will also be added later. This is the first commit of a long
list about filters.
It is possible to have several filters on the same listener/frontend. These
filters are stored in an array of at most MAX_FILTERS elements (define in
include/types/filters.h). Again, this will be replaced later by a list of
filters.
The filter API has been highly refactored. Main changes are:
* Now, HA supports an infinite number of filters per proxy. To do so, filters
are stored in list.
* Because filters are stored in list, filters state has been moved from the
channel structure to the filter structure. This is cleaner because there is no
more info about filters in channel structure.
* It is possible to defined filters on backends only. For such filters,
stream_start/stream_stop callbacks are not called. Of course, it is possible
to mix frontend and backend filters.
* Now, TCP streams are also filtered. All callbacks without the 'http_' prefix
are called for all kind of streams. In addition, 2 new callbacks were added to
filter data exchanged through a TCP stream:
- tcp_data: it is called when new data are available or when old unprocessed
data are still waiting.
- tcp_forward_data: it is called when some data can be consumed.
* New callbacks attached to channel were added:
- channel_start_analyze: it is called when a filter is ready to process data
exchanged through a channel. 2 new analyzers (a frontend and a backend)
are attached to channels to call this callback. For a frontend filter, it
is called before any other analyzer. For a backend filter, it is called
when a backend is attached to a stream. So some processing cannot be
filtered in that case.
- channel_analyze: it is called before each analyzer attached to a channel,
expects analyzers responsible for data sending.
- channel_end_analyze: it is called when all other analyzers have finished
their processing. A new analyzers is attached to channels to call this
callback. For a TCP stream, this is always the last one called. For a HTTP
one, the callback is called when a request/response ends, so it is called
one time for each request/response.
* 'session_established' callback has been removed. Everything that is done in
this callback can be handled by 'channel_start_analyze' on the response
channel.
* 'http_pre_process' and 'http_post_process' callbacks have been replaced by
'channel_analyze'.
* 'http_start' callback has been replaced by 'http_headers'. This new one is
called just before headers sending and parsing of the body.
* 'http_end' callback has been replaced by 'channel_end_analyze'.
* It is possible to set a forwarder for TCP channels. It was already possible to
do it for HTTP ones.
* Forwarders can partially consumed forwardable data. For this reason a new
HTTP message state was added before HTTP_MSG_DONE : HTTP_MSG_ENDING.
Now all filters can define corresponding callbacks (http_forward_data
and tcp_forward_data). Each filter owns 2 offsets relative to buf->p, next and
forward, to track, respectively, input data already parsed but not forwarded yet
by the filter and parsed data considered as forwarded by the filter. A any time,
we have the warranty that a filter cannot parse or forward more input than
previous ones. And, of course, it cannot forward more input than it has
parsed. 2 macros has been added to retrieve these offets: FLT_NXT and FLT_FWD.
In addition, 2 functions has been added to change the 'next size' and the
'forward size' of a filter. When a filter parses input data, it can alter these
data, so the size of these data can vary. This action has an effet on all
previous filters that must be handled. To do so, the function
'filter_change_next_size' must be called, passing the size variation. In the
same spirit, if a filter alter forwarded data, it must call the function
'filter_change_forward_size'. 'filter_change_next_size' can be called in
'http_data' and 'tcp_data' callbacks and only these ones. And
'filter_change_forward_size' can be called in 'http_forward_data' and
'tcp_forward_data' callbacks and only these ones. The data changes are the
filter responsability, but with some limitation. It must not change already
parsed/forwarded data or data that previous filters have not parsed/forwarded
yet.
Because filters can be used on backends, when we the backend is set for a
stream, we add filters defined for this backend in the filter list of the
stream. But we must only do that when the backend and the frontend of the stream
are not the same. Else same filters are added a second time leading to undefined
behavior.
The HTTP compression code had to be moved.
So it simplifies http_response_forward_body function. To do so, the way the data
are forwarded has changed. Now, a filter (and only one) can forward data. In a
commit to come, this limitation will be removed to let all filters take part to
data forwarding. There are 2 new functions that filters should use to deal with
this feature:
* flt_set_http_data_forwarder: This function sets the filter (using its id)
that will forward data for the specified HTTP message. It is possible if it
was not already set by another filter _AND_ if no data was yet forwarded
(msg->msg_state <= HTTP_MSG_BODY). It returns -1 if an error occurs.
* flt_http_data_forwarder: This function returns the filter id that will
forward data for the specified HTTP message. If there is no forwarder set, it
returns -1.
When an HTTP data forwarder is set for the response, the HTTP compression is
disabled. Of course, this is not definitive.
The serial number for a generated certificate was computed using the requested
servername, without any variable/random part. It is not a problem from the
moment it is not regenerated.
But if the cache is disabled or when the certificate is evicted from the cache,
we may need to regenerate it. It is important to not reuse the same serial
number for the new certificate. Else clients (especially browsers) trigger a
warning because 2 certificates issued by the same CA have the same serial
number.
So now, the serial is a static variable initialized with now_ms (internal date
in milliseconds) and incremented at each new certificate generation.
(Ref MPS-2031)
in function 'si_connect', an existing connection is reused (and considered as
established) only when there are some pending data in the output channel.
This can be problem when filters are used, because a filter can choose to not
forward data immediatly. So when we try to initiate a connection to a server,
the output channel can be empty. In this situation, if the connection already
exists, it is not considered as established and nothing happens. If the stream
interface is in the state SI_ST_ASS, this leads to an infinite loop in
process_stream because it remains in this state.
This patch fixes this problem. Now, in 'si_connect', we always reuse an existing
connection, whether or not there are pending data in the output channel.
The function channel_recv_limit() relies on channel_reserved() which
itself relies on channel_in_transit(). Individually they're OK but
combined they're doing the wrong thing.
The problem is that we refrain from filling buffers while to_forward
is even much larger than the buffer because of a semantic issue along
the call chain. This is particularly visible when offloading SSL on
moderately large files (1 MB), though it is also visible on clear text.
Twice the number of recv() calls are made compared to what is needed,
and the typical performance drops by 15-20% in SSL in 1.6 and later,
and no directly measurable drop in 1.5 except when using strace.
There's no need for all these intermediate functions, so let's get
rid of them and reimplement channel_recv_limit() from scratch in a
safer way.
This fix needs to be backported to 1.6 and 1.5 (at least). Note that in
1.5 the function is called buffer_recv_limit() and it may differ a bit.
This function should return a 16-bit type as that is the type for
dns header id.
Also because it is doing an uint16 unpack big-endian operation.
Backport: can be backported to 1.6
Signed-off-by: Thiago Farina <tfarina@chromium.org>
Signed-off-by: Baptiste Assmann <bedis9@gmail.com>
The function http_reply_and_close has been added in proto_http.c to wrap calls
to stream_int_retnclose. This functions will be modified when the filters will
be added.
It is possible to create a http capture rule which points to a capture slot
id which does not exist.
Current patch prevent this when parsing configuration and prevent running
configuration which contains such rules.
This configuration is now invalid:
frontend f
bind :8080
http-request capture req.hdr(User-Agent) id 0
default_backend b
this one as well:
frontend f
bind :8080
declare capture request len 32 # implicit id is 0 here
http-request capture req.hdr(User-Agent) id 1
default_backend b
It applies of course to both http-request and http-response rules.
This is done by adding EVP_PKEY_EC type in supported types for the CA private
key when we get the message digest used to sign a generated X509 certificate.
So now, we support DSA, RSA and EC private keys.
And to be sure, when the type of the private key is not directly supported, we
get its default message digest using the function
'EVP_PKEY_get_default_digest_nid'.
We also use the key of the default certificate instead of generated it. So we
are sure to use the same key type instead of always using a RSA key.
First, the LRU cache must be initialized after the configuration parsing to
correctly set its size.
Next, the function 'ssl_sock_set_generated_cert' returns -1 when an error occurs
(0 if success). In that case, the caller is responsible to free the memory
allocated for the certificate.
Finally, when a SSL certificate is generated by HAProxy but cannot be inserted
in the cache, it must be freed when the SSL connection is closed. This happens
when 'tune.ssl.ssl-ctx-cache-size' is set to 0.
Michael Ezzell reported a bug causing haproxy to segfault during startup
when trying to send syslog message from Lua. The function __send_log() can
be called with *p that is NULL and/or when the configuration is not fully
parsed, as is the case with Lua.
This patch fixes this problem by using individual vectors instead of the
pre-generated strings log_htp and log_htp_rfc5424.
Also, this patch fixes a problem causing haproxy to write the wrong pid in
the logs -- the log_htp(_rfc5424) strings were generated at the haproxy
start, but "pid" value would be changed after haproxy is started in
daemon/systemd mode.
This patch adds a new RFC5424-specific log-format for the structured-data
that is automatically send by __send_log() when the sender is in RFC5424
mode.
A new statement "log-format-sd" should be used in order to set log-format
for the structured-data part in RFC5424 formatted syslog messages.
Example:
log-format-sd [exampleSDID@1234\ bytes=\"%B\"\ status=\"%ST\"]
The function __send_log() iterates over senders and passes the header as
the first vector to sendmsg(), thus it can send a logger-specific header
in each message.
A new logger arguments "format rfc5424" should be used in order to enable
RFC5424 header format. For example:
log 10.2.3.4:1234 len 2048 format rfc5424 local2 info
At the moment we have to call snprintf() for every log line just to
rebuild a constant. Thanks to sendmsg(), we send the message in 3 parts:
time-based header, proxy-specific hostname+log-tag+pid, session-specific
message.
This new target can be called from the frontend or the backend. It
is evaluated just before the backend choice and just before the server
choice. So, the input stream or HTTP request can be forwarded to a
server or to an internal service.
While the SI_ST_DIS state is set *after* doing the close on a connection,
it was set *before* calling release on an applet. Applets have no internal
flags contrary to connections, so they have no way to detect they were
already released. Because of this it happened that applets were closed
twice, once via si_applet_release() and once via si_release_endpoint() at
the end of a transaction. The CLI applet could perform a double free in
this case, though the situation to cause it is quite hard because it
requires that the applet is stuck on output in states that produce very
few data.
In order to solve this, we now assign the SI_ST_DIS state *after* calling
->release, and we refrain from doing so if the state is already assigned.
This makes applets work much more like connections and definitely avoids
this double release.
In the future it might be worth making applets have their own flags like
connections to carry their own state regardless of the stream interface's
state, especially when dealing with connection reuse.
No backport is needed since this issue was caused by the rearchitecture
in 1.6.
This function is a callback made only for calls from the applet handler.
Rename it to remove confusion. It's currently called from the Lua code
but that's not correct, we should call the notify and update functions
instead otherwise it will not enable the applet again.
This one is not needed anymore as what it used to do is either
completely covered by the new stream_int_notify() function, or undesired
and inherited from the past as a side effect of introducing the
connections.
This update is theorically never called since it's assigned only when
nothing is connected to the stream interface. However a test has been
added to si_update() to stay safe if some foreign code decides to call
si_update() in unsafe situations.
stream_int_notify() was taken from the common part between si_conn_wake_cb()
and si_applet_done(). It is designed to report activity to a stream from
outside its handler. It'll generally be used by lower layers to report I/O
completion but may also be used by remote streams if the buffer processing
is shared.
Now the call to stream_int_update() is moved to si_update(), which
is exclusively called from the stream, so that the socket layer may
be updated without updating the stream layer. This will later permit
to call it individually from other places (other tasks or applets for
example).
This function is designed to be called from within the stream handler to
update the channels' expiration timers and the stream interface's flags
based on the channels' flags. It needs to be called only once after the
channels' flags have settled down, and before they are cleared, though it
doesn't harm to call it as often as desired (it just slightly hurts
performance). It must not be called from outside of the stream handler,
as what it does will be used to compute the stream task's expiration.
The code was taken directly from stream_int_update_applet() and
stream_int_update_conn() which had exactly the same one except for
applet-specific or connection-specific status update.
This is not a real run queue and we're facing ugly bugs because
if this : if a an applet removes another applet from the queue,
typically the next one after itself, the list iterator loops
forever because the list's backup pointer is not valid anymore.
Before creating a run queue, let's rename this list.
Mailing list participant "mlist" reported negative conn_cur values in
stick tables as the result of "tcp-request connection track-sc". The
reason is that after the stick entry it copied from the session to the
stream, both the session and the stream grab a reference to the entry
and when the stream ends, it decrements one reference and one connection,
then the same is done for the session.
In fact this problem was already encountered slightly differently in the
past and addressed by Thierry using the patch below as it was believed by
then to be only a refcount issue since it was the observable symptom :
827752e "BUG/MEDIUM: stick-tables: refcount error after copying SC..."
In reality the problem is that the stream must touch neither the refcount
nor the connection count for entries it inherits from the session. While
we have no way to tell whether a track entry was inherited from the session
(since they're simply memcpy'd), it is possible to prevent the stream from
touching an entry that already exists in the session because that's a
guarantee that it was inherited from it.
Note that it may be a temporary fix. Maybe in the future when a session
gives birth to multiple streams we'll face a situation where a session may
be updated to add more tracked entries after instanciating some streams.
The correct long-term fix is to mark some tracked entries as shared or
private (or RO/RW). That will allow the session to track more entries
even after the same trackers are being used by early streams.
No backport is needed, this is only caused by the session/stream split in 1.6.
Added the definition of CHECK_HTTP_MESSAGE_FIRST and the declaration of
smp_prefetch_http to the header.
Changed smp_prefetch_http implementation to remove the static qualifier.
This directive gives HAProxy the ability to use the either the global
server-state-file directive or a local one using server-state-file-name to
load server states.
The state can be saved right before the reload by the init script, using
the "show servers state" command on the stats socket redirecting output into
a file.
This patch is inspired by Bowen Ni's proposal and it is based on his first
implementation:
With Lua integration in HAProxy 1.6, one can change the request method,
path, uri, header, response header etc except response line.
I'd like to contribute the following methods to allow modification of the
response line.
[...]
There are two new keywords in 'http-response' that allows you to rewrite
them in the native HAProxy config. There are also two new APIs in Lua that
allows you to do the same rewriting in your Lua script.
Example:
Use it in HAProxy config:
*http-response set-code 404*
Or use it in Lua script:
*txn.http:res_set_reason("Redirect")*
I dont take the full patch because the manipulation of the "reason" is useless.
standard reason are associated with each returned code, and unknown code can
take generic reason.
So, this patch can set the status code, and the reason is automatically adapted.
This was the first transparent proxy technology supported by haproxy
circa 2005 but it was obsoleted in 2007 by Tproxy 4.0 which removed a
lot of the earlier versions' shortcomings and was finally merged into
the kernel. Since nobody has been using cttproxy for many years now
and nobody has even just tried to compile the files, it's time to
remove it. The doc was updated as well.
This patch removes the special stick tables types names and
use the standard sample type names. This avoid the maintainance
of two types and remove the switch/case for matching a sample
type for each stick table type.
This patch is the first step for sample integration. Actually
the stick tables uses her own data type, and some converters
must be called to convert sample type to stick-tables types.
This patch removes the stick-table types and replace it by
the sample types. This prevent:
- Maintenance of two types of converters
- reduce the code using the samples converters
Each (http|tcp)-(request|response) action use the same method
for looking up the action keyword during the cofiguration parsing.
This patch mutualize the code.
This patch merges the conguration keyword struct. Each declared configuration
keyword struct are similar with the others. This patch simplify the code.
For performances considerations, some actions are not processed by remote
function. They are directly processed by the function. Some of these actions
does the same things but for different processing part (request / response).
This patch give the same name for the same actions, and change the normalization
of the other actions names.
This patch is ONLY a rename, it doesn't modify the code.
This patch is the first of a serie which merge all the action structs. The
function "tcp-request content", "tcp-response-content", "http-request" and
"http-response" have the same values and the same process for some defined
actions, but the struct and the prototype of the declared function are
different.
This patch try to unify all of these entries.
This patch remove the struct information stored both in the struct
sample_data and in the striuct sample. Now, only thestruct sample_data
contains data, and the struct sample use the struct sample_data for storing
his own data.
appsessions started to be deprecated with the introduction of stick
tables, and the latter are much more powerful and flexible, and in
addition they are replicated between nodes and maintained across
reloads. Let's now remove appsession completely.
This function is now dedicated to idle connections only, which means
that it must not be used without any endpoint nor anything not a
connection. The connection remains attached to the stream interface.
This list member will be used to attach a connection to a list of
idle, reusable or queued connections. It's unused for now. Given
that it's not expected to be used more than a few times per session,
the member was put after the target, in the area starting at the
second cache line of the structure.
This function only detaches the endpoint from the stream-int and
optionally returns the original pointer. This will be needed to
steal idle connections from other connections.
Since we now always call this function with the reuse parameter cleared,
let's simplify the function's logic as it cannot return the existing
connection anymore. The savings on this inline function are appreciable
(240 bytes) :
$ size haproxy.old haproxy.new
text data bss dec hex filename
1020383 40816 36928 1098127 10c18f haproxy.old
1020143 40816 36928 1097887 10c09f haproxy.new
This patch introduces three new functions which can be used to find a
server in a farm using different server information:
- server unique id (srv->puid)
- server name
- find best match using either name or unique id
When performing best matching, the following applies:
- use the server name first (if provided)
- use the server id if provided
in any case, the function can update the caller about mismatches
encountered.
Change si_alloc_conn() to call si_release_endpoint() instead of
open-coding the connection releasing code when reuse is disabled.
This fuses the code with the one already dealing with applets, makes
it shorter and helps centralizing the connection freeing logic at a
single place.
ssl_sock_set_servername() is used to set the SNI hostname on an
outgoing connection. This function comes from code originally
provided by Christopher Faulet of Qualys.
This modification makes possible to use sample_fetch_string() in more places,
where we might need to fetch sample values which are not plain strings. This
way we don't need to fetch string, and convert it into another type afterwards.
When using aliased types, the caller should explicitly check which exact type
was returned (e.g. SMP_T_IPV4 or SMP_T_IPV6 for SMP_T_ADDR).
All usages of sample_fetch_string() are converted to use new function.
When Lua is disabled, the alternate functions must have the same
prototype as the original ones, otherwise we get such warnings :
src/stream.c:278:27: warning: too many arguments in call to 'hlua_ctx_destroy'
hlua_ctx_destroy(&s->hlua);
~~~~~~~~~~~~~~~~ ^
No backport is needed.
This patch adds two functions used for variable acces using the
variable full name. If the variable doesn't exists in the variable
pool name, it is created.
This patch adds support of variables during the processing of each stream. The
variables scope can be set as 'session', 'transaction', 'request' or 'response'.
The variable type is the type returned by the assignment expression. The type
can change while the processing.
The allocated memory can be controlled for each scope and each request, and for
the global process.
Implementation of a DNS client in HAProxy to perform name resolution to
IP addresses.
It relies on the freshly created UDP client to perform the DNS
resolution. For now, all UDP socket calls are performed in the
DNS layer, but this might change later when the protocols are
extended to be more suited to datagram mode.
A new section called 'resolvers' is introduced thanks to this patch. It
is used to describe DNS servers IP address and also many parameters.
Basic introduction of a UDP layer in HAProxy. It can be used as a
client only and manages UDP exchanges with servers.
It can't be used to load-balance UDP protocols, but only used by
internal features such as DNS resolution.
Ability to change a server IP address during HAProxy run time.
For now this is provided via function update_server_addr() which
currently is not called.
A log is emitted on each change. For now we do it inconditionally,
but later we'll want to do it only on certain circumstances, which
explains why the logging block is enclosed in if(1).
Following functions are now available in the SSL public API:
* ssl_sock_create_cert
* ssl_sock_get_generated_cert
* ssl_sock_set_generated_cert
* ssl_sock_generated_cert_serial
These functions could be used to create a certificate by hand, set it in the
cache used to store generated certificates and retrieve it. Here is an example
(pseudo code):
X509 *cacert = ...;
EVP_PKEY *capkey = ...;
char *servername = ...;
unsigned int serial;
serial = ssl_sock_generated_cert_serial(servername, strlen(servername));
if (!ssl_sock_get_generated_cert(serial, cacert)) {
SSL_CTX *ctx = ssl_sock_create_cert(servername, serial, cacert, capkey);
ssl_sock_set_generated_cert(ctx, serial, cacert);
}
With this patch, it is possible to configure HAProxy to forge the SSL
certificate sent to a client using the SNI servername. We do it in the SNI
callback.
To enable this feature, you must pass following BIND options:
* ca-sign-file <FILE> : This is the PEM file containing the CA certitifacte and
the CA private key to create and sign server's certificates.
* (optionally) ca-sign-pass <PASS>: This is the CA private key passphrase, if
any.
* generate-certificates: Enable the dynamic generation of certificates for a
listener.
Because generating certificates is expensive, there is a LRU cache to store
them. Its size can be customized by setting the global parameter
'tune.ssl.ssl-ctx-cache-size'.
This patch adds the ssl-dh-param-file global setting. It sets the
default DH parameters that will be used during the SSL/TLS handshake when
ephemeral Diffie-Hellman (DHE) key exchange is used, for all "bind" lines
which do not explicitely define theirs.
This patch does'nt add any new feature: the functional behavior
is the same than version 1.0.
Technical differences:
In this version all updates on different stick tables are
multiplexed on the same tcp session. There is only one established
tcp session per peer whereas in first version there was one established
tcp session per peer and per stick table.
Messages format was reviewed to be more evolutive and to support
further types of data exchange such as SSL sessions or other sticktable's
data types (currently only the sticktable's server id is supported).
In order to support http-response redirect, the parsing needs to be
adapted a little bit to only support the "location" type, and to
adjust the log-format parser so that it knows the direction of the
sample fetch calls.
This function tries to spot a proxy by its name, ID and type, and
in case some elements don't match, it tries to determine which ones
could be ignored and reports which ones were ignored so that the
caller can decide whether or not it wants to pick this proxy. This
will be used for maintaining the status across reloads where the
config might have changed a bit.
It does the same as the other one except that it only focuses on the
numeric ID and the capabilities. It's used by proxy_find_by_name()
for numeric names.
First, findproxy() was renamed proxy_find_by_name() so that its explicit
that a name is required for the lookup. Second, we give this function
the ability to search for tables if needed. Third we now provide inline
wrappers to pass the appropriate PR_CAP_* flags and to explicitly look
up a frontend, backend or table.
Until now, HAproxy needed to be restarted to change the TLS ticket
keys. With this patch, the TLS keys can be updated on a per-file
basis using the admin socket. Two new socket commands have been
introduced: "show tls-keys" and "set ssl tls-keys".
Signed-off-by: Nenad Merdanovic <nmerdan@anine.io>
This patch removes the structs "session", "stream" and "proxy" from
the sample-fetches and converters function prototypes.
This permits to remove some weight in the prototype call.
Currently we have a problem. There are some cases where a sleeping applet
is not woken up (eg: show sess during an injection). The reason is that
the applet is marked WAIT_DATA and is not woken up when WAIT_ROOM leaves,
because we wait for both flags to be cleared in order to call it.
And if we wait for either flag, then we have the opposite situation, which
is that we're not waiting for room in the output buffer so we're spinning
calling the applet to do nothing.
What is missing is an indication of what the applet needs. Since it only
manipulates the WAIT_ROOM/WAIT_DATA which are overwritten later, that cannot
work. In the case of connections, the problem doesn't happen because the
connection maintains these extra states. Ideally we'd need to have similar
states for each appctx and to store those information there. But it would
be overcomplicated given that an applet doesn't exist alone without a
stream-int, so we can safely put these information into the stream int and
make the code simpler.
With this patch we introduce two new flags in the stream interface :
- SI_FL_WANT_PUT : the applet wants to put something into the buffer
- SI_FL_WANT_GET : the applet wants to get something from the buffer
We also have the new functions si_applet_{stop|want|cant}_{get|put}
to make the code look similar to the connection code.
For now these flags are not used yet.
This is the equivalent of si_conn_wake() but for applets. It will be
called after changes to the stream interface are brought by the applet
I/O handler. Ultimately it will release buffers and may be even wake
the stream's task up if some important changes are detected.
It would be nice to be able to merge it with the connection's wake
function since it mostly manipulates the stream interface, but there
are minor differences (such as how to enable/disable polling on a fd
vs applet) and some specificities to applets (eg: don't wake the
applet up until the output is empty) which would require abstract
functions which would slow down everything.
The new function is called for each round of polling in order to call any
active appctx. For now we pick the stream interface from the appctx's
owner. At the moment there's no appctx queued yet, but we have everything
needed to queue them and remove them.
Now that applet's functions only take an appctx in argument, not a
stream interface. This slightly simplifies the code and will be needed
to take the appctx out of the stream interface.
It happened after changing the stream interface deinitialization
sequence that we got random crashes with si_shutw() being called
on NULL si->end. The reason was that si->ops was not reset after
a call to si_release_endpoint() which is sometimes called directly.
Thus we now move the resetting of si->ops just after any si->end
assignment. It happens that si_detach() is now just the same as
si_release_endpoint() and stream_int_unregister_handler(). Some
cleanup will have to be performed there.
It's not sure whether this problem can impact 1.5 since in 1.5
applets are part of the default embedded stream handler. The only
way it could cause some trouble is if it's used with a connection,
which doesn't seem possible at first glance.
We don't pass sess->origin anymore but the pointer to the previous step. Now
it should be much easier to chain elements together once applets are moved out
of streams. Indeed, the session is only used for configuration and not for the
dynamic chaining anymore.
This patch cretes a new Map class that permits to do some lookup in
HAProxy maps. This Map class is integration in the HAProxy update
system, so we can modify the map throught the socket.
The function was called stream_accept_session(), let's rename it
stream_new() and make it return the newly allocated pointer. It's
more convenient for some callers who need it.
This concerns everythins related to accepting a new session and
expiring the embryonic session. There's still a hard-coded call
to stream_accept_session() which could be set somewhere in the
frontend, but for now it's not a problem.
It passes a NULL wherever a stream was needed (acl_exec_cond() and
action_ptr mainly). It can still track the connection rate correctly
and block based on ACLs.
In order to support sessions tracking counters, we first ensure that there
is no overlap between streams' stkctr and sessions', and we allow an
automatic lookup into the session's counters when the stream doesn't
have a counter or when the stream doesn't exist during an access via
a sample fetch. The functions used to update the stream counters only
update them and not the session counters however.
Many such function need a session, and till now they used to dereference
the stream. Once we remove the stream from the embryonic session, this
will not be possible anymore.
So as of now, sample fetch functions will be called with this :
- sess = NULL, strm = NULL : never
- sess = valid, strm = NULL : tcp-req connection
- sess = valid, strm = valid, strm->txn = NULL : tcp-req content
- sess = valid, strm = valid, strm->txn = valid : http-req / http-res
The registerable http_req_rules / http_res_rules used to require a
struct http_txn at the end. It's redundant with struct stream and
propagates very deep into some parts (ie: it was the reason for lua
requiring l7). Let's remove it now.
All of them can now retrieve the HTTP transaction *if it exists* from
the stream and be sure to get NULL there when called with an embryonic
session.
The patch is a bit large because many locations were touched (all fetch
functions had to have their prototype adjusted). The opportunity was
taken to also uniformize the call names (the stream is now always "strm"
instead of "l4") and to fix indent where it was broken. This way when
we later introduce the session here there will be less confusion.
Now this one is dynamically allocated. It means that 280 bytes of memory
are saved per TCP stream, but more importantly that it will become
possible to remove the l7 pointer from fetches and converters since
it will be deduced from the stream and will support being null.
A lot of care was taken because it's easy to forget a test somewhere,
and the previous code used to always trust s->txn for being valid, but
all places seem to have been visited.
All HTTP fetch functions check the txn first so we shouldn't have any
issue there even when called from TCP. When branching from a TCP frontend
to an HTTP backend, the txn is properly allocated at the same time as the
hdr_idx.
There is now a pointer to the session in the stream, which is NULL
for now. The session pool is created as well. Some parts will move
from the stream to the session now.
With HTTP/2, we'll have to support multiplexed streams. A stream is in
fact the largest part of what we currently call a session, it has buffers,
logs, etc.
In order to catch any error, this commit removes any reference to the
struct session and tries to rename most "session" occurrences in function
names to "stream" and "sess" to "strm" when that's related to a session.
The files stream.{c,h} were added and session.{c,h} removed.
The session will be reintroduced later and a few parts of the stream
will progressively be moved overthere. It will more or less contain
only what we need in an embryonic session.
Sample fetch functions and converters will have to change a bit so
that they'll use an L5 (session) instead of what's currently called
"L4" which is in fact L6 for now.
Once all changes are completed, we should see approximately this :
L7 - http_txn
L6 - stream
L5 - session
L4 - connection | applet
There will be at most one http_txn per stream, and a same session will
possibly be referenced by multiple streams. A connection will point to
a session and to a stream. The session will hold all the information
we need to keep even when we don't yet have a stream.
Some more cleanup is needed because some code was already far from
being clean. The server queue management still refers to sessions at
many places while comments talk about connections. This will have to
be cleaned up once we have a server-side connection pool manager.
Stream flags "SN_*" still need to be renamed, it doesn't seem like
any of them will need to move to the session.
There's no reason for exporting identity_* nor deflate_*, they're only
used in the same file. Mark them static, it will make it easier to add
other algorithms.
Till now we used to rely on a fixed maximum chunk size. Thanks to last
commit we're now free to adjust the chunk's length before sending the
data, so we don't have to use 6 digits all the time anymore, and if
one wants buffers larger than 16 MB it is now possible.
This function is a callback for HTTP actions. This function
creates the replacement string from a build_logline() format
and transform the header.
This patch split this function in two part. With this modification,
the header transformation and the replacement string are separed.
We can now transform the header with another replacement string
source than a build_logline() format.
The first part is the replacement engine. It take a replacement action
number and a replacement string and process the action.
The second part is the function which is called by the 'http-request
action' to replace a request line part. This function makes the
string used as replacement.
This split permits to use the replacement engine in other parts of the
code than the request action. The Lua use it for his own http action.
It's now called conn_sock_drain() to make it clear that it only reads
at the sock layer and not at the data layer. The function was too big
to remain inlined and it's used at a few places where size counts.
Currently si_idle_conn_null_cb() has to perform some low-level checks
over the file descriptor and the connection configuration that should
only belong to conn_drain(). Let's move these controls there. The
function now automatically checks for errors and hangups on the file
descriptor for example, and disables recv polling if there's no drain
function at the control layer.
This function is an equivalent to send() which operates over a connection
instead of a file descriptor. It checks that the control layer is ready
and that it's allowed to send. If automatically enables polling if it
cannot send. It simplifies the return checks by returning zero in all
cases where it cannot send so that the caller only has to care about
negative values indicating errors.
This will save callers from having to care about conn->xprt and xprt->shutw.
Note that shutw() takes a second argument indicating whether it's a clean or
a hard shutw. This is used by SSL which tries to close cleanly in most cases.
Here we provide two versions, conn_data_shutw() which performs the clean
close, and conn_data_shutw_hard() which does the unclean one.
This function was not used yet and was only supposed to mark the connection
as shutdown for write. Unfortunately at other places in stream_interface.c,
we're seeing a bit of layering violations with attempts to perform the shutdown
on the fd directly. Let's make this function call shutdown() itself so that
the callers only have to care about the connection.
Now that we can get the session from the channel, let's simplify the
prototype of session_alloc_recv_buffer() to only require the channel.
Both the caller and the function are now simplified.
The purpose of these two macros will be to pass via the session to
find the relevant stream interfaces so that we don't need to store
the ->cons nor ->prod pointers anymore. Currently they're only defined
so that all references could be removed.
Note that many places need a second pass of clean up so that we don't
have any chn_prod(&s->req) anymore and only &s->si[0] instead, and
conversely for the 3 other cases.
At a few places we need to find one stream interface from the other one.
Instead of passing via the channel, we simply use the session as an
intermediary, which simply results in applying an offset to the pointer.
We go back to the session to get the owner. Here again it's very easy
and is just a matter of relative offsets. Since the owner always exists
and always points to the session's task, we can remove some unneeded
tests.
In order to plan removal of si->ib / si->ob, we now check the side of the
stream interface and find the session, then the requested channel. In
practice it's just an offset applied to the pointer based on the flag.
This new flag "SI_FL_ISBACK" is set only on the back SI and is cleared
on the front SI. That way it's possible only by looking at the SI to
know what side it is.
We'll soon remove direct references to the channels from the stream
interface since everything belongs to the same session, so let's
first not dereference si->ib / si->ob anymore and use macros instead.
If we are writing in the request buffer, we are not waked up
when the data are forwarded because it is useles. The request
analyzers are waked up only when data is incoming. So, if the
request buffer is full, we set the WAKE_ON_WRITE flag.
Before this patch, each yield in a Lua action set a flags to be
waked up when some activity were detected on the response channel.
This behavior causes loop in the analyzer process.
This patch set the wake up on response buffer activity only if we
really want to be waked up on this activity.
This flag indicate that the current yield is returned by the Lua
execution task control. If this flag is set, the current task may
quit but will be set in the run queue to be re-executed immediatly.
This patch modify the "hlua_yieldk()" function, it adds an argument
that contain a field containing yield options.
In the future, the lua execution must return scheduling informations.
We want more than one flag, so I convert an integer used with an
enum into an interer used as bitfield.
This system permits to execute some lua function after than HAProxy
complete his initialisation. These functions are executed between
the end of the configuration parsing and check and the begin of the
scheduler.
This is the first step of the lua integration. We add the useful
files in the HAProxy project. These files contains the main
includes, the Makefile options and empty initialisation function.
Is is the LUA skeleton.
We now have functions to retrieve one block and one line from
either the input or the output part of a buffer. They return
up to two (pointer,length) values in case the buffer wraps.
This patch introduces an action keyword registration system for TCP
rulesets similar to what is available for HTTP rulesets. This sytem
will be useful with lua.
These modifications are done for resolving cross-dependent
includes in the upcoming LUA code.
<proto/channel.h> misses <types/channel.h>.
<types/acl.h> doesn't use <types/session.h> because the session
is already declared in the file as undefined pointer.
appsession.c misses <unistd.h> to use "write()".
Declare undefined pointer "struct session" for <types/proxy.h>
and <types/queue.h>. These includes dont need the detail of this
struct.
The functions "val_payload_lv" and "val_hdr" are useful with
lua. The lua automatic binding for sample fetchs needs to
compare check functions.
The "arg_type_names" permit to display error messages.
Actually, HAProxy uses the function "process_runnable_tasks" and
"wake_expired_tasks" to get the next task which can expires.
If a task is added with "task_schedule" or other method during
the execution of an other task, the expiration of this new task
is not taken into account, and the execution of this task can be
too late.
Actualy, HAProxy seems to be no sensitive to this bug.
This fix moves the call to process_runnable_tasks() before the timeout
calculation and ensures that all wakeups are processed together. Only
wake_expired_tasks() needs to return a timeout now.
As found by Thierry Fournier, if a task manages to kill another one and
if this other task is the next one in the run queue, we can do whatever
including crashing, because the scheduler restarts from the saved next
task. For now, there is no such concept of a task killing another one,
but with Lua it will come.
A solution consists in always performing the lookup of the first task in
the scheduler's loop, but it's expensive and costs around 2% of the
performance.
Another solution consists in keeping a global next run queue node and
ensuring that when this task gets removed, it updates this pointer to
the next one. This allows to simplify the code a bit and in the end to
slightly increase the performance (0.3-0.5%). The mechanism might still
be usable if we later migrate to a multi-threaded scheduler.
The patch "MEDIUM: args: increase arg type to 5 bits and limit arg count
to 5" (dbc79d0a) increased the number of types supported, but forgot to
remove the ARG6/ARG7 macros.
This patch adds a new option which allows configuration of the maximum
log level of messages for which email alerts will be sent.
The default is alert which is more restrictive than
the current code which sends email alerts for all priorities.
That behaviour may be configured using the new configuration
option to set the maximum level to notice or greater.
email-alert level notice
Signed-off-by: Simon Horman <horms@verge.net.au>
Refactor init_check so that an error string is returned
rather than alerts being printed by it. Also
init_check to checks.c and provide a prototype to allow
it to be used from multiple C files.
Signed-off-by: Simon Horman <horms@verge.net.au>
This is in order to add new types. This patch does not change anything
else. Two remaining (harmless) occurrences of a count of 8 instead of 7
were fixed by this patch : empty_arg_list[] and the for() loop counting
args.
It applies to the channel and it doesn't erase outgoing data, only
pending unread data, which is strictly equivalent to what recv()
does with MSG_TRUNC, so that new name is more accurate and intuitive.
This name more accurately reminds that it applies to a channel and not
to a buffer, and that what is returned may be used as a max number of
bytes to pass to recv().
This applies to the channel, not the buffer, so let's fix this name.
Warning, the function's name happens to be the same as the old one
which was mistakenly used during 1.5.
This function's name was poorly chosen and is confusing to the point of
being suspiciously used at some places. The operations it does always
consider the ability to forward pending input data before receiving new
data. This is not obvious at all, especially at some places where it was
used when consuming outgoing data to know if the buffer has any chance
to ever get the missing data. The code needs to be re-audited with that
in mind. Care must be taken with existing code since the polarity of the
function was switched with the renaming.
channel_reserved is confusingly named. It is used to know whether or
not the rewrite area is left intact for situations where we want to
ensure we can use it before proceeding. Let's rename it to fix this
confusion.
Option http-send-name-header is still hurting. If a POST request has to be
redispatched when this option is used, and the next server's name is larger
than the initial one, and the POST body fills the buffer, it becomes
impossible to rewrite the server's name in the buffer when redispatching.
In 1.4, this is worse, the process may crash because of a negative size
computation for the memmove().
The only solution to fix this is to refrain from eating the reserve before
we're certain that we won't modify the buffer anymore. And the condition for
that is that the connection is established.
This patch introduces "channel_may_send()" which helps to detect whether it's
safe to eat the reserve or not. This condition is used by channel_in_transit()
introduced by recent patches.
This patch series must be backported into 1.5, and a simpler version must be
backported into 1.4 where fixing the bug is much easier since there were no
channels by then. Note that in 1.4 the severity is major.
This function returns the amount of bytes in transit in a channel's buffer,
which is the amount of outgoing data plus the amount of incoming data bound
to the forward limit.
We know that all incoming data are going to be purged if to_forward
is greater than them, not only if greater than the buffer size. This
buf has no direct impact on this version, but it participates to some
bugs affecting http-send-name-header since 1.4. This fix will have to
be backported down to 1.4 albeit in a different form.
The buffer_max_len() function is subject to an integer overflow in this
calculus :
int ret = global.tune.maxrewrite - chn->to_forward - chn->buf->o;
- chn->to_forward may be up to 2^31 - 1
- chn->buf->o may be up to chn->buf->size
- global.tune.maxrewrite is by definition smaller than chn->buf->size
Thus here we can subtract (2^31 + buf->o) (highly negative) from something
slightly positive, and result in ret being larger than expected.
Fortunately in 1.5 and 1.6, this is only used by bi_avail() which itself
is used by applets which do not set high values for to_forward so this
problem does not happen there. However in 1.4 the equivalent computation
was used to limit the size of a read and can result in a read overflow
when combined with the nasty http-send-name-header feature.
This fix must be backported to 1.5 and 1.4.
Since commit 3dd6a25 ("MINOR: stream-int: retrieve session pointer from
stream-int"), we can get the session from the task, so let's get rid of
this less obvious function.
bi_swpbuf() swaps the buffer passed in argument with the one attached to
the channel, but only if this last one is empty. The idea is to avoid a
copy when buffers can simply be swapped.
We've already experimented with three wake up algorithms when releasing
buffers : the first naive one used to wake up far too many sessions,
causing many of them not to get any buffer. The second approach which
was still in use prior to this patch consisted in waking up either 1
or 2 sessions depending on the number of FDs we had released. And this
was still inaccurate. The third one tried to cover the accuracy issues
of the second and took into consideration the number of FDs the sessions
would be willing to use, but most of the time we ended up waking up too
many of them for nothing, or deadlocking by lack of buffers.
This patch completely removes the need to allocate two buffers at once.
Instead it splits allocations into critical and non-critical ones and
implements a reserve in the pool for this. The deadlock situation happens
when all buffers are be allocated for requests pending in a maxconn-limited
server queue, because then there's no more way to allocate buffers for
responses, and these responses are critical to release the servers's
connection in order to release the pending requests. In fact maxconn on
a server creates a dependence between sessions and particularly between
oldest session's responses and latest session's requests. Thus, it is
mandatory to get a free buffer for a response in order to release a
server connection which will permit to release a request buffer.
Since we definitely have non-symmetrical buffers, we need to implement
this logic in the buffer allocation mechanism. What this commit does is
implement a reserve of buffers which can only be allocated for responses
and that will never be allocated for requests. This is made possible by
the requester indicating how much margin it wants to leave after the
allocation succeeds. Thus it is a cooperative allocation mechanism : the
requester (process_session() in general) prefers not to get a buffer in
order to respect other's need for response buffers. The session management
code always knows if a buffer will be used for requests or responses, so
that is not difficult :
- either there's an applet on the initiator side and we really need
the request buffer (since currently the applet is called in the
context of the session)
- or we have a connection and we really need the response buffer (in
order to support building and sending an error message back)
This reserve ensures that we don't take all allocatable buffers for
requests waiting in a queue. The downside is that all the extra buffers
are really allocated to ensure they can be allocated. But with small
values it is not an issue.
With this change, we don't observe any more deadlocks even when running
with maxconn 1 on a server under severely constrained memory conditions.
The code becomes a bit tricky, it relies on the scheduler's run queue to
estimate how many sessions are already expected to run so that it doesn't
wake up everyone with too few resources. A better solution would probably
consist in having two queues, one for urgent requests and one for normal
requests. A failed allocation for a session dealing with an error, a
connection event, or the need for a response (or request when there's an
applet on the left) would go to the urgent request queue, while other
requests would go to the other queue. Urgent requests would be served
from 1 entry in the pool, while the regular ones would be served only
according to the reserve. Despite not yet having this, it works
remarkably well.
This mechanism is quite efficient, we don't perform too many wake up calls
anymore. For 1 million sessions elapsed during massive memory contention,
we observe about 4.5M calls to process_session() compared to 4.0M without
memory constraints. Previously we used to observe up to 16M calls, which
rougly means 12M failures.
During a test run under high memory constraints (limit enforced to 27 MB
instead of the 58 MB normally needed), performance used to drop by 53% prior
to this patch. Now with this patch instead it *increases* by about 1.5%.
The best effect of this change is that by limiting the memory usage to about
2/3 to 3/4 of what is needed by default, it's possible to increase performance
by up to about 18% mainly due to the fact that pools are reused more often
and remain hot in the CPU cache (observed on regular HTTP traffic with 20k
objects, buffers.limit = maxconn/10, buffers.reserve = limit/2).
Below is an example of scenario which used to cause a deadlock previously :
- connection is received
- two buffers are allocated in process_session() then released
- one is allocated when receiving an HTTP request
- the second buffer is allocated then released in process_session()
for request parsing then connection establishment.
- poll() says we can send, so the request buffer is sent and released
- process session gets notified that the connection is now established
and allocates two buffers then releases them
- all other sessions do the same till one cannot get the request buffer
without hitting the margin
- and now the server responds. stream_interface allocates the response
buffer and manages to get it since it's higher priority being for a
response.
- but process_session() cannot allocate the request buffer anymore
=> We could end up with all buffers used by responses so that none may
be allocated for a request in process_session().
When the applet processing leaves the session context, the test will have
to be changed so that we always allocate a response buffer regardless of
the left side (eg: H2->H1 gateway). A final improvement would consists in
being able to only retry the failed I/O operation without waking up a
task, but to date all experiments to achieve this have proven not to be
reliable enough.
When a session_alloc_buffers() fails to allocate one or two buffers,
it subscribes the session to buffer_wq, and waits for another session
to release buffers. It's then removed from the queue and woken up with
TASK_WAKE_RES, and can attempt its allocation again.
We decide to try to wake as many waiters as we release buffers so
that if we release 2 and two waiters need only once, they both have
their chance. We must never come to the situation where we don't wake
enough tasks up.
It's common to release buffers after the completion of an I/O callback,
which can happen even if the I/O could not be performed due to half a
failure on memory allocation. In this situation, we don't want to move
out of the wait queue the session that was just added, otherwise it
will never get any buffer. Thus, we only force ourselves out of the
queue when freeing the session.
Note: at the moment, since session_alloc_buffers() is not used, no task
is subscribed to the wait queue.
This patch introduces session_alloc_recv_buffer(), session_alloc_buffers()
and session_release_buffers() whose purpose will be to allocate missing
buffers and release unneeded ones around the process_session() and during
I/O operations.
I/O callbacks only need a single buffer for recv operations and none
for send. However we still want to ensure that we don't pick the last
buffer. That's what session_alloc_recv_buffer() is for.
This allocator is atomic in that it always ensures we can get 2 buffers
or fails. Here, if any of the buffers is not ready and cannot be
allocated, the operation is cancelled. The purpose is to guarantee that
we don't enter into the deadlock where all buffers are allocated by the
same size of all sessions.
A queue will have to be implemented for failed allocations. For now
they're just reported as failures.
Till now we'd consider a buffer full even if it had size==0 due to pointing
to buf.size. Now we change this : if buf_wanted is present, it means that we
have already tried to allocate a buffer but failed. Thus the buffer must be
considered full so that we stop trying to poll for reads on it. Otherwise if
it's empty, it's buf_empty and we report !full since we may allocate it on
the fly.
Channels are now created with a valid pointer to a buffer before the
buffer is allocated. This buffer is a global one called "buf_empty" and
of size zero. Thus it prevents any activity from being performed on
the buffer and still ensures that chn->buf may always be dereferenced.
b_free() also resets the buffer to &buf_empty, and was split into
b_drop() which does not reset the buffer.
We'll soon need to be able to switch buffers without touching the
channel, so let's move buffer initialization out of channel_init().
We had the same in compressoin.c.
Since commit 656c5fa7e8 ("BUILD: ssl: disable OCSP when using
boringssl) the OCSP code is bypassed when OPENSSL_IS_BORINGSSL
is defined. The correct thing to do here is to use OPENSSL_NO_OCSP
instead, which is defined for this exact purpose in
openssl/opensslfeatures.h.
This makes haproxy forward compatible if boringssl ever introduces
full OCSP support with the additional benefit that it links fine
against a OCSP-disabled openssl.
Signed-off-by: Lukas Tribus <luky-37@hotmail.com>
A memory optimization can use the same pattern expression for many
equal pattern list (same parse method, index method and index_smp
method).
The pattern expression is returned by "pattern_new_expr", but this
function dont indicate if the returned pattern is already in use.
So, the caller function reload the list of patterns in addition with
the existing patterns. This behavior is not a problem with tree indexed
pattern, but it grows the lists indexed patterns.
This fix add a "reuse" flag in return of the function "pattern_new_expr".
If the flag is set, I suppose that the patterns are already loaded.
This fix must be backported into 1.5.
In order for HTTP/2 not to eat too much memory, we'll have to support
on-the-fly buffer allocation, since most streams will have an empty
request buffer at some point. Supporting allocation on the fly means
being able to sleep inside I/O callbacks if a buffer is not available.
Till now, the I/O callbacks were called from two locations :
- when processing the cached events
- when processing the polled events from the poller
This change cleans up the design a bit further than what was started in
1.5. It now ensures that we never call any iocb from the poller itself
and that instead, events learned by the poller are put into the cache.
The benefit is important in terms of stability : we don't have to care
anymore about the risk that new events are added into the poller while
processing its events, and we're certain that updates are processed at
a single location.
To achieve this, we now modify all the fd_* functions so that instead of
creating updates, they add/remove the fd to/from the cache depending on
its state, and only create an update when the polling status reaches a
state where it will have to change. Since the pollers make use of these
functions to notify readiness (using fd_may_recv/fd_may_send), the cache
is always up to date with the poller.
Creating updates only when the polling status needs to change saves a
significant amount of work for the pollers : a benchmark showed that on
a typical TCP proxy test, the amount of updates per connection dropped
from 11 to 1 on average. This also means that the update list is smaller
and has more chances of not thrashing too many CPU cache lines. The first
observed benefit is a net 2% performance gain on the connection rate.
A second benefit is that when a connection is accepted, it's only when
we're processing the cache, and the recv event is automatically added
into the cache *after* the current one, resulting in this event to be
processed immediately during the same loop. Previously we used to have
a second run over the updates to detect if new events were added to
catch them before waking up tasks.
The next gain will be offered by the next steps on this subject consisting
in implementing an I/O queue containing all cached events ordered by priority
just like the run queue, and to be able to leave some events pending there
as long as needed. That will allow us *not* to perform some FD processing
if it's not the proper time for this (typically keep waiting for a buffer
to be allocated if none is available for an recv()). And by only processing
a small bunch of them, we'll allow priorities to take place even at the I/O
level.
As a result of this change, functions fd_alloc_or_release_cache_entry()
and fd_process_polled_events() have disappeared, and the code dedicated
to checking for new fd events after the callback during the poll() loop
was removed as well. Despite the patch looking large, it's mostly a
change of what function is falled upon fd_*() and almost nothing was
added.
This patch makes it possible to create binds and servers in separate
namespaces. This can be used to proxy between multiple completely independent
virtual networks (with possibly overlapping IP addresses) and a
non-namespace-aware proxy implementation that supports the proxy protocol (v2).
The setup is something like this:
net1 on VLAN 1 (namespace 1) -\
net2 on VLAN 2 (namespace 2) -- haproxy ==== proxy (namespace 0)
net3 on VLAN 3 (namespace 3) -/
The proxy is configured to make server connections through haproxy and sending
the expected source/target addresses to haproxy using the proxy protocol.
The network namespace setup on the haproxy node is something like this:
= 8< =
$ cat setup.sh
ip netns add 1
ip link add link eth1 type vlan id 1
ip link set eth1.1 netns 1
ip netns exec 1 ip addr add 192.168.91.2/24 dev eth1.1
ip netns exec 1 ip link set eth1.$id up
...
= 8< =
= 8< =
$ cat haproxy.cfg
frontend clients
bind 127.0.0.1:50022 namespace 1 transparent
default_backend scb
backend server
mode tcp
server server1 192.168.122.4:2222 namespace 2 send-proxy-v2
= 8< =
A bind line creates the listener in the specified namespace, and connections
originating from that listener also have their network namespace set to
that of the listener.
A server line either forces the connection to be made in a specified
namespace or may use the namespace from the client-side connection if that
was set.
For more documentation please read the documentation included in the patch
itself.
Signed-off-by: KOVACS Tamas <ktamas@balabit.com>
Signed-off-by: Sarkozi Laszlo <laszlo.sarkozi@balabit.com>
Signed-off-by: KOVACS Krisztian <hidden@balabit.com>
When we're stopping, we're not going to create new tasks anymore, so
let's release the task pool upon each task_free() in order to reduce
memory fragmentation.
Commit 179085c ("MEDIUM: http: move Connection header processing earlier")
introduced a regression : the backend's HTTP mode is not considered anymore
when setting the session's HTTP mode, because wait_for_request() is only
called once, when the frontend receives the request (or when the frontend
is in TCP mode, when the backend receives the request).
The net effect is that in some situations when the frontend and the backend
do not work in the same mode (eg: keep-alive vs close), the backend's mode
is ignored.
This patch moves all that processing to a dedicated function, which is
called from the original place, as well as from session_set_backend()
when switching from an HTTP frontend to an HTTP backend in different
modes.
This fix must be backported to 1.5.
There are two sample commands to get information about the presence of a
client certificate.
ssl_fc_has_crt is true if there is a certificate present in the current
connection
ssl_c_used is true if there is a certificate present in the session.
If a session has stopped and resumed, then ssl_c_used could be true, while
ssl_fc_has_crt is false.
In the client byte of the TLS TLV of Proxy Protocol V2, there is only one
bit to indicate whether a certificate is present on the connection. The
attached patch adds a second bit to indicate the presence for the session.
This maintains backward compatibility.
[wt: this should be backported to 1.5 to help maintain compatibility
between versions]
Google's boringssl doesn't currently support OCSP, so
disable it if detected.
OCSP support may be reintroduced as per:
https://code.google.com/p/chromium/issues/detail?id=398677
In that case we can simply revert this commit.
Signed-off-by: Lukas Tribus <luky-37@hotmail.com>
If a source file includes proto/server.h twice or more, redefinition errors will
be triggered for such inline functions as server_throttle_rate(),
server_is_draining(), srv_adm_set_maint() and so on. Just move #endif directive
to the end of file to solve this issue.
Signed-off-by: Godbach <nylzhaowei@gmail.com>
Add support for http-request track-sc, similar to what is done in
tcp-request for backends. A new act_prm field was added to HTTP
request rules to store the track params (table, counter). Just
like for TCP rules, the table is resolved while checking for
config validity. The code was mostly copied from the TCP code
with the exception that here we also count the HTTP request count
and rate by hand. Probably that something could be factored out in
the future.
It seems like tracking flags should be improved to mark each hook
which tracks a key so that we can have some check points where to
increase counters of the past if not done yet, a bit like is done
for TRACK_BACKEND.
Some users want to add their own data types to stick tables. We don't
want to use a linked list here for performance reasons, so we need to
continue to use an indexed array. This patch allows one to reserve a
compile-time-defined number of extra data types by setting the new
macro STKTABLE_EXTRA_DATA_TYPES to anything greater than zero, keeping
in mind that anything larger will slightly inflate the memory consumed
by stick tables (not per entry though).
Then calling stktable_register_data_store() with the new keyword will
either register a new keyword or fail if the desired entry was already
taken or the keyword already registered.
Note that this patch does not dictate how the data will be used, it only
offers the possibility to create new keywords and have an index to
reference them in the config and in the tables. The caller will not be
able to use stktable_data_cast() and will have to explicitly cast the
stable pointers to the expected types. It can be used for experimentation
as well.
Currently we have stktable_fetch_key() which fetches a sample according
to an expression and returns a stick table key, but we also need a function
which does only the second half of it from a known sample. So let's cut the
function in two and introduce smp_to_stkey() to perform this lookup. The
first function was adapted to make use of it in order to avoid code
duplication.
Abstract namespace sockets ignore the shutdown() call and do not make
it possible to temporarily stop listening. The issue it causes is that
during a soft reload, the new process cannot bind, complaining that the
address is already in use.
This change registers a new pause() function for unix sockets and
completely unbinds the abstract ones since it's possible to rebind
them later. It requires the two previous patches as well as preceeding
fixes.
This fix should be backported into 1.5 since the issue apperas there.
In order to fix the abstact socket pause mechanism during soft restarts,
we'll need to proceed differently depending on the socket protocol. The
pause_listener() function already supports some protocol-specific handling
for the TCP case.
This commit makes this cleaner by adding a new ->pause() function to the
protocol struct, which, if defined, may be used to pause a listener of a
given protocol.
For now, only TCP has been adapted, with the specific code moved from
pause_listener() to tcp_pause_listener().
With all the goodies supported by logformat, people find that the limit
of 1024 chars for log lines is too short. Some servers do not support
larger lines and can simply drop them, so changing the default value is
not always the best choice.
This patch takes a different approach. Log line length is specified per
log server on the "log" line, with a value between 80 and 65535. That
way it's possibly to satisfy all needs, even with some fat local servers
and small remote ones.
stktable_fetch_key() does not indicate whether it returns NULL because
the input sample was not found or because it's unstable. It causes trouble
with track-sc* rules. Just like with sample_fetch_string(), we want it to
be able to give more information to the caller about what it found. Thus,
now we use the pointer to a sample passed by the caller, and fill it with
the information we have about the sample. That way, even if we return NULL,
the caller has the ability to check whether a sample was found and if it is
still changing or not.
'ssl_sock_get_common_name' applied to a connection was also renamed
'ssl_sock_get_remote_common_name'. Currently, this function is only used
with protocol PROXYv2 to retrieve the client certificate's common name.
A further usage could be to retrieve the server certificate's common name
on an outgoing connection.
The support is all based on static responses. This doesn't add any
request / response logic to HAProxy, but allows a way to update
information through the socket interface.
Currently certificates specified using "crt" or "crt-list" on "bind" lines
are loaded as PEM files.
For each PEM file, haproxy checks for the presence of file at the same path
suffixed by ".ocsp". If such file is found, support for the TLS Certificate
Status Request extension (also known as "OCSP stapling") is automatically
enabled. The content of this file is optional. If not empty, it must contain
a valid OCSP Response in DER format. In order to be valid an OCSP Response
must comply with the following rules: it has to indicate a good status,
it has to be a single response for the certificate of the PEM file, and it
has to be valid at the moment of addition. If these rules are not respected
the OCSP Response is ignored and a warning is emitted. In order to identify
which certificate an OCSP Response applies to, the issuer's certificate is
necessary. If the issuer's certificate is not found in the PEM file, it will
be loaded from a file at the same path as the PEM file suffixed by ".issuer"
if it exists otherwise it will fail with an error.
It is possible to update an OCSP Response from the unix socket using:
set ssl ocsp-response <response>
This command is used to update an OCSP Response for a certificate (see "crt"
on "bind" lines). Same controls are performed as during the initial loading of
the response. The <response> must be passed as a base64 encoded string of the
DER encoded response from the OCSP server.
Example:
openssl ocsp -issuer issuer.pem -cert server.pem \
-host ocsp.issuer.com:80 -respout resp.der
echo "set ssl ocsp-response $(base64 -w 10000 resp.der)" | \
socat stdio /var/run/haproxy.stat
This feature is automatically enabled on openssl 0.9.8h and above.
This work was performed jointly by Dirkjan Bussink of GitHub and
Emeric Brun of HAProxy Technologies.
This patch adds two new actions to http-request and http-response rulesets :
- replace-header : replace a whole header line, suited for headers
which might contain commas
- replace-value : replace a single header value, suited for headers
defined as lists.
The match consists in a regex, and the replacement string takes a log-format
and supports back-references.
Using the last rate counters, we now compute the queue, connect, response
and total times per server and per backend with a 95% accuracy over the last
1024 samples. The operation is cheap so we don't need to condition it.
While the current functions report average event counts per period, we are
also interested in average values per event. For this we use a different
method. The principle is to rely on a long tail which sums the new value
with a fraction of the previous value, resulting in a sliding window of
infinite length depending on the precision we're interested in.
The idea is that we always keep (N-1)/N of the sum and add the new sampled
value. The sum over N values can be computed with a simple program for a
constant value 1 at each iteration :
N
,---
\ N - 1 e - 1
> ( --------- )^x ~= N * -----
/ N e
'---
x = 1
Note: I'm not sure how to demonstrate this but at least this is easily
verified with a simple program, the sum equals N * 0.632120 for any N
moderately large (tens to hundreds).
Inserting a constant sample value V here simply results in :
sum = V * N * (e - 1) / e
But we don't want to integrate over a small period, but infinitely. Let's
cut the infinity in P periods of N values. Each period M is exactly the same
as period M-1 with a factor of ((N-1)/N)^N applied. A test shows that given a
large N :
N - 1 1
( ------- )^N ~= ---
N e
Our sum is now a sum of each factor times :
N*P P
,--- ,---
\ N - 1 e - 1 \ 1
> v ( --------- )^x ~= VN * ----- * > ---
/ N e / e^x
'--- '---
x = 1 x = 0
For P "large enough", in tests we get this :
P
,---
\ 1 e
> --- ~= -----
/ e^x e - 1
'---
x = 0
This simplifies the sum above :
N*P
,---
\ N - 1
> v ( --------- )^x = VN
/ N
'---
x = 1
So basically by summing values and applying the last result an (N-1)/N factor
we just get N times the values over the long term, so we can recover the
constant value V by dividing by N.
A value added at the entry of the sliding window of N values will thus be
reduced to 1/e or 36.7% after N terms have been added. After a second batch,
it will only be 1/e^2, or 13.5%, and so on. So practically speaking, each
old period of N values represents only a quickly fading ratio of the global
sum :
period ratio
1 36.7%
2 13.5%
3 4.98%
4 1.83%
5 0.67%
6 0.25%
7 0.09%
8 0.033%
9 0.012%
10 0.0045%
So after 10N samples, the initial value has already faded out by a factor of
22026, which is quite fast. If the sliding window is 1024 samples wide, it
means that a sample will only count for 1/22k of its initial value after 10k
samples went after it, which results in half of the value it would represent
using an arithmetic mean. The benefit of this method is that it's very cheap
in terms of computations when N is a power of two. This is very well suited
to record response times as large values will fade out faster than with an
arithmetic mean and will depend on sample count and not time.
Demonstrating all the above assumptions with maths instead of a program is
left as an exercise for the reader.
This patch adds support for a new "drain" mode. So now we have 3 admin
modes for a server :
- READY
- DRAIN
- MAINT
The drain mode disables load balancing but leaves the server up. It can
coexist with maint, except that maint has precedence. It is also inherited
from tracked servers, so just like maint, it's represented with 2 bits.
New functions were designed to set/clear each flag and to propagate the
changes to tracking servers when relevant, and to log the changes. Existing
functions srv_set_adm_maint() and srv_set_adm_ready() were replaced to make
use of the new functions.
Currently the drain mode is not yet used, however the whole logic was tested
with all combinations of set/clear of both flags in various orders to catch
all corner cases.
This function was taken from check_set_server_drain(). It does not
consider health checks at all and only sets a server to stopping
provided it's not in maintenance and is not currently stopped. The
resulting state will be STOPPING. The state change is propagated
to tracked servers.
For now the function is not used, but the goal is to split health
checks status from server status and to be able to change a server's
state regardless of health checks statuses.
This function was taken from check_set_server_up(). It does not consider
health checks at all and only sets a server up provided it's not in
maintenance. The resulting state may be either RUNNING or STARTING
depending on the presence of a slowstart or not. The state change is
propagated to tracked servers.
For now the function is not used, but the goal is to split health
checks status from server status and to be able to change a server's
state regardless of health checks statuses.
This function was extracted from check_set_server_down(). In only
manipulates the server state and does not consider the health checks
at all, nor does it modify their status. It takes a reason message to
report in logs, however it passes NULL when recursing through the
trackers chain.
For now the function is not used, but the goal is to split health
checks status from server status and to be able to change a server's
state regardless of health checks statuses.
srv_adm_append_status() was renamed srv_append_status() since it's no
more dedicated to maintenance mode. It now supports a reason which if
not null is appended to the output string.
We don't have to handle the maintenance transition here anymore so we
can simplify the functions and conditions. This also means that we don't
need the disable/enable functions but only a function to switch to each
new state.
It's worth mentionning that at this stage there are still confusions
between the server state and the checks states. For example, the health
check's state is adjusted from tracked servers changing state, while it
should not be.
This change now involves a new flag SRV_ADMF_IMAINT to note that the
maintenance status of a server is inherited from another server. Thus,
we know at each server level in the chain if it's running, in forced
maintenance or in a maintenance status because it tracks another server,
or even in both states.
Disabling a server propagates this flag down to other servers. Enabling
a server flushes the flag down. A server becomes up again once both of
its flags are cleared.
Two new functions "srv_adm_set_maint()" and "srv_adm_set_ready()" are used to
manipulate this maintenance status. They're used by the CLI and the stats
page.
Now the stats page always says "MAINT" instead of "MAINT(via)" and it's
only the chk/down field which reports "via x/y" when the status is
inherited from another server, but it doesn't say it when a server was
forced into maintenance. The CSV output indicates "MAINT (via x/y)"
instead of only "MAINT(via)". This is the most accurate representation.
One important thing is that now entering/leaving maintenance for a
tracking server correctly follows the state of the tracked server.
Checks.c has become a total mess. A number of proxy or server maintenance
and queue management functions were put there probably because they were
used there, but that makes the code untouchable. And that's without saying
that their names does not always relate to what they really do!
So let's do a first pass by moving these ones :
- set_backend_down() => backend.c
- redistribute_pending() => queue.c:pendconn_redistribute()
- check_for_pending() => queue.c:pendconn_grab_from_px()
- shutdown_sessions => server.c:srv_shutdown_sessions()
- shutdown_backup_sessions => server.c:srv_shutdown_backup_sessions()
All of them were moved at once.
Servers used to have 3 flags to store a state, now they have 4 states
instead. This avoids lots of confusion for the 4 remaining undefined
states.
The encoding from the previous to the new states can be represented
this way :
SRV_STF_RUNNING
| SRV_STF_GOINGDOWN
| | SRV_STF_WARMINGUP
| | |
0 x x SRV_ST_STOPPED
1 0 0 SRV_ST_RUNNING
1 0 1 SRV_ST_STARTING
1 1 x SRV_ST_STOPPING
Note that the case where all bits were set used to exist and was randomly
dealt with. For example, the task was not stopped, the throttle value was
still updated and reported in the stats and in the http_server_state header.
It was the same if the server was stopped by the agent or for maintenance.
It's worth noting that the internal function names are still quite confusing.
Now we introduce srv->admin and srv->prev_admin which are bitfields
containing one bit per source of administrative status (maintenance only
for now). For the sake of backwards compatibility we implement a single
source (ADMF_FMAINT) but the code already checks any source (ADMF_MAINT)
where the STF_MAINTAIN bit was previously checked. This will later allow
us to add ADMF_IMAINT for maintenance mode inherited from tracked servers.
Along doing these changes, it appeared that some places will need to be
revisited when implementing the inherited bit, this concerns all those
modifying the ADMF_FMAINT bit (enable/disable actions on the CLI or stats
page), and the checks to report "via" on the stats page. But currently
the code is harmless.
Till now, the server's state and flags were all saved as a single bit
field. It causes some difficulties because we'd like to have an enum
for the state and separate flags.
This commit starts by splitting them in two distinct fields. The first
one is srv->state (with its counter-part srv->prev_state) which are now
enums, but which still contain bits (SRV_STF_*).
The flags now lie in their own field (srv->flags).
The function srv_is_usable() was updated to use the enum as input, since
it already used to deal only with the state.
Note that currently, the maintenance mode is still in the state for
simplicity, but it must move as well.
When run in daemon mode (i.e. with at least one forked process) and using
the epoll poller, sending USR1 (graceful shutdown) to the worker processes
can cause some workers to start running at 100% CPU. Precondition is having
an established HTTP keep-alive connection when the signal is received.
The cloned (during fork) listening sockets do not get closed in the parent
process, thus they do not get removed from the epoll set automatically
(see man 7 epoll). This can lead to the process receiving epoll events
that it doesn't feel responsible for, resulting in an endless loop around
epoll_wait() delivering these events.
The solution is to explicitly remove these file descriptors from the epoll
set. To not degrade performance, care was taken to only do this when
neccessary, i.e. when the file descriptor was cloned during fork.
Signed-off-by: Conrad Hoffmann <conrad@soundcloud.com>
[wt: a backport to 1.4 could be studied though chances to catch the bug are low]
We used to call srv_is_usable() with either the current state and weights
or the previous ones. This causes trouble for future changes, so let's first
split it in two variants :
- srv_is_usable(srv) considers the current status
- srv_was_usable(srv) considers the previous status
Detecting that a server's status has changed is a bit messy, as well
as it is to commit the status changes. We'll have to add new conditions
soon and we'd better avoid to multiply the number of touched locations
with the high risk of forgetting them.
This commit introduces :
- srv_lb_status_changed() to report if the status changed from the
previously committed one ;
- svr_lb_commit_status() to commit the current status
The function is now used by all load-balancing algorithms.
This flag is only a copy of (srv->uweight == 0), so better get rid of
it to reduce some of the confusion that remains in the code, and use
a simple function to return this state based on this weight instead.
Being able to map prefixes to values is already used for IPv4/IPv6
but was not yet used with strings. It can be very convenient to map
directories to server farms but large lists may be slow.
By using ebmb_insert_prefix() and ebmb_lookup_longest(), we can
insert strings with their own length as a prefix, and lookup
candidate strings and ensure that the longest matching one will
be returned, which is the longest string matching the entry.
This commit modifies the PROXY protocol V2 specification to support headers
longer than 255 bytes allowing for optional extensions. It implements the
PROXY protocol V2 which is a binary representation of V1. This will make
parsing more efficient for clients who will know in advance exactly how
many bytes to read. Also, it defines and implements some optional PROXY
protocol V2 extensions to send information about downstream SSL/TLS
connections. Support for PROXY protocol V1 remains unchanged.
Process shared mutex seems not supported on some OSs (FreeBSD).
This patch checks errors on mutex lock init to fallback
on a private session cache (per process cache) in error cases.
Last fix did address the issue for inlined patterns, but it was not
enough because the flags are lost as well when updating patterns
dynamically over the CLI.
Also if the same file was used once with -i and another time without
-i, their references would have been merged and both would have used
the same matching method.
It's appear that the patterns have two types of flags. The first
ones are relative to the pattern matching, and the second are
relative to the pattern storage. The pattern matching flags are
the same for all the patterns of one expression. Now they are
stored in the expression. The storage flags are information
returned by the pattern mathing function. This information is
relative to each entry and is stored in the "struct pattern".
Now, the expression matching flags are forwarded to the parse
and index functions. These flags are stored during the
configuration parsing, and they are used during the parse and
index actions.
This issue was introduced in dev23 with the major pattern rework,
and is a continuation of commit a631fc8 ("BUG/MAJOR: patterns: -i
and -n are ignored for inlined patterns"). No backport is needed.
These flags are only passed to pattern_read_from_file() which
loads the patterns from a file. The functions used to parse the
patterns from the current line do not provide the means to pass
the pattern flags so they're lost.
This issue was introduced in dev23 with the major pattern rework,
and was reported by Graham Morley. No backport is needed.
Using the previous callback, it's trivial to block the heartbeat attack,
first we control the message length, then we emit an SSL error if it is
out of bounds. A special log is emitted, indicating that a heartbleed
attack was stopped so that they are not confused with other failures.
That way, haproxy can protect itself even when running on an unpatched
SSL stack. Tests performed with openssl-1.0.1c indicate a total success.
Users have seen a huge increase in the rate of SSL handshake failures
starting from 2014/04/08 with the release of the Heartbleed OpenSSL
vulnerability (CVE-2014-0160). Haproxy can detect that a heartbeat
was received in the incoming handshake, and such heartbeats are not
supposed to be common, so let's log a different message when a
handshake error happens after a heartbeat is detected.
This patch only adds the new message and the new code.
The http_(res|req)_keywords_register() functions allow to register
new keywords.
You need to declare a keyword list:
struct http_req_action_kw_list test_kws = {
.scope = "testscope",
.kw = {
{ "test", parse_test },
{ NULL, NULL },
}
};
and a parsing function:
int parse_test(const char **args, int *cur_arg, struct proxy *px, struct http_req_rule *rule, char **err)
{
rule->action = HTTP_REQ_ACT_CUSTOM_STOP;
rule->action_ptr = action_function;
return 0;
}
http_req_keywords_register(&test_kws);
The HTTP_REQ_ACT_CUSTOM_STOP action stops evaluation of rules after
your rule, HTTP_REQ_ACT_CUSTOM_CONT permits the evaluation of rules
after your rule.
Finn Arne Gangstad suggested that we should have the ability to break
keep-alive when the target server has reached its maxconn and that a
number of connections are present in the queue. After some discussion
around his proposed patch, the following solution was suggested : have
a per-proxy setting to fix a limit to the number of queued connections
on a server after which we break keep-alive. This ensures that even in
high latency networks where keep-alive is beneficial, we try to find a
different server.
This patch is partially based on his original proposal and implements
this configurable threshold.
All the code inherited from version 1.1 still holds a lot ot sessions
called "t" because in 1.1 they were tasks. This naming is very annoying
and sometimes even confusing, for example in code involving tables.
Let's get rid of this once for all and before 1.5-final.
Nothing changed beyond just carefully renaming these variables.
This basically reimplements commit f3221f9 ("MEDIUM: stats: add support
for HTTP keep-alive on the stats page") which was reverted by commit
51437d2 after Igor Chan reported a broken stats page caused by the bug
fix by previous commit.
In order to avoid abusively relying on buf->o to guess how many bytes to
rewind during a redispatch, we now clear msg->sov. Thus the meaning of this
field is exactly "how many bytes of headers are left to be forwarded". It
is still possible to rewind because msg->eoh + msg->eol equal that value
before scheduling the forwarding, so we can always subtract them.
http_body_rewind() returns the number of bytes to rewind before buf->p to
find the message's body. It relies on http_hdr_rewind() to find the beginning
and adds msg->eoh + msg->eol which are always safe.
http_data_rewind() does the same to get the beginning of the data, which
differs from above when a chunk is present. It uses the function above and
adds msg->sol.
The purpose is to centralize further ->sov changes aiming at avoiding
to rely on buf->o.
http_uri_rewind() returns the number of bytes to rewind before buf->p to
find the URI. It relies on http_hdr_rewind() to find the beginning and
is just here to simplify operations.
The purpose is to centralize further ->sov changes aiming at avoiding
to rely on buf->o.
http_hdr_rewind() returns the number of bytes to rewind before buf->p to
find the beginning of headers. At the moment it's not exact as it still
relies on buf->o, assuming that no other data from a past message were
pending there, but it's what was done till there.
The purpose is to centralize further ->sov changes aiming at avoiding
to rely on buf->o.
http_body_bytes() returns the number of bytes of the current message body
present in the buffer. It is compatible with being called before and after
the headers are forwarded.
This is done to centralize further ->sov changes.
The function addr_to_stktable_key doesn't consider the expected
type of key. If the stick table key is based on IPv6 addresses
and the input is IPv4, the returned key is IPv4 adddress and his
length is 4 bytes, while is expected 16 bytes key.
This patch considers the expected key and try to convert IPv4 to
IPv6 and IPv6 to IPv4 according with the expected key.
This fixes the bug reported by Apollon Oikonomopoulos.
This bug was introduced somewhere in the 1.5-dev process.
The cfgparse.c file becomes huge, and a large part of it comes from the
server keyword parser. Since the configuration is a bit more modular now,
move this parser to server.c.
This patch also moves the check of the "server" keyword earlier in the
supported keywords list, resulting in a slightly faster config parsing
for configs with large numbers of servers (about 10%).
No functional change was made, only the code was moved.
This function it is used for dynamically update all the patterns
attached to one file. This function is atomic. All parsing or indexation
failures are reported in the haproxy logs.
This patch adds new display type. This display returns allocated string,
when the string is flush into buffers, it is freed. This permit to
return the content of "memprintf(err, ...)" messages.
The pat_ref_add functions has changed to return error.
The acl and map function do the same work with the file parsing. This
patch merge these code in only one.
Note that the function map_read_entries_from_file() in the file "map.c"
is moved to the the function pat_ref_read_from_file_smp() in the file
"pattern.c". The code of this function is not modified, only the the
name and the arguments order has changed.
The find_smp search the smp using the value of the pat_ref_elt pointer.
The pat_find_smp_* are no longer used. The function pattern_find_smp()
known all pattern indexation, and can be found
All the pattern delete function can use her reference to the original
"struct pat_ref_elt" to find the element to be remove. The functions
pat_del_list_str() and pat_del_meth() were deleted because after
applying this modification, they have the same code than pat_del_list_ptr().
The pattern reference are stored with two identifiers: the unique_id and
the reference.
The reference identify a file. Each file with the same name point to the
same reference. We can register many times one file. If the file is
modified, all his dependencies are also modified. The reference can be
used with map or acl.
The unique_id identify inline acl. The unique id is unique for each acl.
You cannot force the same id in the configuration file, because this
repport an error.
The format of the acl and map listing through the "socket" has changed
for displaying these new ids.
This patch extract the expect_type variable from the "struct pattern" to
"struct pattern_head". This variable is set during the declaration of
ACL and MAP. With this change, the function "pat_parse_len()" become
useless and can be replaced by "pat_parse_int()".
Implicit ACLs by default rely on the fetch's output type, so let's simply do
the same for all other ones. It has been verified that they all match.
This patch add the following socket command line options:
show acl [<id>]
clear acl <id>
get acl <id> <pattern>
del acl <id> <pattern>
add acl <id> <pattern>
The system used for maps is backported in the pattern functions.
Some functions needs to change the sample associated to pattern. This
new pointer permit to return the a pointer to the sample pointer. The
caller can use or change the value.
This commit adds a delete function for patterns. It looks up all
instances of the pattern to delete and deletes them all. The fetch
keyword declarations have been extended to point to the appropriate
delete function.
The match function known the format of the pattern. The pattern can be
stored in a list or in a tree. The pattern matching function use itself
the good entry point and indexation type.
Each pattern matching function return the struct pattern that match. If
the flag "fill" is set, the struct pattern is filled, otherwise the
content of this struct must not be used.
With this feature, the general pattern matching function cannot have
exceptions for building the "struct pattern".
Before this commit, the pattern_exec_match() function returns the
associate sample, the associate struct pattern or the associate struct
pattern_tree. This is complex to use, because we can check the type of
information returned.
Now the function return always a "struct pattern". If <fill> is not set,
only the value of the pointer can be used as boolean (NULL or other). If
<fill> is set, you can use the <smp> pointer and the pattern
information.
If information must be duplicated, it is stored in trash buffer.
Otherwise, the pattern can point on existing strings.
The method are actuelly stored using two types. Integer if the method is
known and string if the method is not known. The fetch is declared as
UINT, but in some case it can provides STR.
This patch create new type called METH. This type contain interge for
known method and string for the other methods. It can be used with
automatic converters.
The pattern matching can expect method.
During the free or prune function, http_meth pettern is freed. This
patch initialise the freed pointer to NULL.
The operations applied on types SMP_T_CSTR and SMP_T_STR are the same,
but the check code and the declarations are double, because it must
declare action for SMP_T_C* and SMP_T_*. The declared actions and checks
are the same. this complexify the code. Only the "conv" functions can
change from "C*" to "*"
Now, if a function needs to modify input string, it can call the new
function smp_dup(). This one duplicate data in a trash buffer.
The pattern parse functions put the parsed result in a "struct pattern"
without memory allocation. If the pattern must reference the input data
without changes, the pattern point to the parsed string. If buffers are
needed to store translated data, it use th trash buffer. The indexation
function that allocate the memory later if it is needed.
Before this patch, the indexation function check the declared patttern
matching function and index the data according with this function. This
is not useful to add some indexation mode.
This commit adds dedicated indexation function. Each struct pattern is
associated with one indexation function. This function permit to index
data according with the type of pattern and with the type of match.
This commit separes the "struct list" used for the chain the "struct
pattern" which contain the pattern data. Later, this change will permit
to manipulate lists ans trees with the same "struct pattern".
Each pattern parser take only one string. This change is reported to the
function prototype of the function "pattern_register()". Now, it is
called with just one string and no need to browse the array of args.
After the previous patches, the "pat_parse_strcat()" function disappear,
and the "pat_parse_int()" and "pat_parse_dotted_ver()" functions dont
use anymore the "opaque" argument, and take only one string on his
input.
So, after this patch, each pattern parser no longer use the opaque
variable and take only one string as input. This patch change the
prototype of the pattern parsing functions.
Now, the "char **args" is replaced by a "char *arg", the "int *opaque"
is removed and these functions return 1 in succes case, and 0 if fail.
This patch remove the limit of 32 groups. It also permit to use standard
"pat_parse_str()" function in place of "pat_parse_strcat()". The
"pat_parse_strcat()" is no longer used and its removed. Before this
patch, the groups are stored in a bitfield, now they are stored in a
list of strings. The matching is slower, but the number of groups is
low and generally the list of allowed groups is short.
The fetch function "smp_fetch_http_auth_grp()" used with the name
"http_auth_group" return valid username. It can be used as string for
displaying the username or with the acl "http_auth_group" for checking
the group of the user.
Maybe the names of the ACL and fetch methods are no longer suitable, but
I keep the current names for conserving the compatibility with existing
configurations.
The function "userlist_postinit()" is created from verification code
stored in the big function "check_config_validity()". The code is
adapted to the new authentication storage system and it is moved in the
"src/auth.c" file. This function is used to check the validity of the
users declared in groups and to check the validity of groups declared
on the "user" entries.
This resolve function is executed before the check of all proxy because
many acl needs solved users and groups.
Large configurations can take time to parse when thousands of backends
are in use. Let's store all the proxies in trees.
findproxy_mode() has been modified to use the tree for lookups, which
has divided the parsing time by about 2.5. But many lookups are still
present at many places and need to be dealt with.
We store the time stamp of last read in the channel in order to
be able to measure some bit rate and pause lengths. We only use
16 bits which were unused for this. We don't need more, as it
allows us to measure with a millisecond precision for up to 65s.
These ones are only reset during transfers. There is a low but non-null
risk that a first full read causes the previous value to be reused and
immediately to immediately set the CF_STREAMER flag. The impact is only
to increase earlier than expected the SSL record size and to use splice().
This bug was already present in 1.4, so a backport is possible.
Summary:
Track and report last session time on the stats page for each server
in every backend, as well as the backend.
This attempts to address the requirement in the ROADMAP
- add a last activity date for each server (req/resp) that will be
displayed in the stats. It will be useful with soft stop.
The stats page reports this as time elapsed since last session. This
change does not adequately address the requirement for long running
session (websocket, RDP... etc).
Till now, we had one flag per stick counter to indicate if it was
tracked in a backend or in a frontend. We just had to add another
flag per stick-counter to indicate if it relies on contents or just
connection. These flags are quite painful to maintain and tend to
easily conflict with other flags if their number is changed.
The correct solution consists in moving the flags to the stkctr struct
itself, but currently this struct is made of 2 pointers, so adding a
new entry there to store only two bits will cause at least 16 more bytes
to be eaten per counter due to alignment issues, and we definitely don't
want to waste tens to hundreds of bytes per session just for things that
most users don't use.
Since we only need to store two bits per counter, an intermediate
solution consists in replacing the entry pointer with a composite
value made of the original entry pointer and the two flags in the
2 unused lower bits. If later a need for other flags arises, we'll
have to store them in the struct.
A few inline functions have been added to abstract the retrieval
and assignment of the pointers and flags, resulting in very few
changes. That way there is no more dependence on the number of
stick-counters and their position in the session flags.
One year ago, commit 5d5b5d8 ("MEDIUM: proto_tcp: add support for tracking
L7 information") brought support for tracking L7 information in tcp-request
content rules. Two years earlier, commit 0a4838c ("[MEDIUM] session-counters:
correctly unbind the counters tracked by the backend") used to flush the
backend counters after processing a request.
While that earliest patch was correct at the time, it became wrong after
the second patch was merged. The code does what it says, but the concept
is flawed. "TCP request content" rules are evaluated for each HTTP request
over a single connection. So if such a rule in the frontend decides to
track any L7 information or to track L4 information when an L7 condition
matches, then it is applied to all requests over the same connection even
if they don't match. This means that a rule such as :
tcp-request content track-sc0 src if { path /index.html }
will count one request for index.html, and another one for each of the
objects present on this page that are fetched over the same connection
which sent the initial matching request.
Worse, it is possible to make the code do stupid things by using multiple
counters:
tcp-request content track-sc0 src if { path /foo }
tcp-request content track-sc1 src if { path /bar }
Just sending two requests first, one with /foo, one with /bar, shows
twice the number of requests for all subsequent requests. Just because
both of them persist after the end of the request.
So the decision to flush backend-tracked counters was not the correct
one. In practice, what is important is to flush countent-based rules
since they are the ones evaluated for each request.
Doing so requires new flags in the session however, to keep track of
which stick-counter was tracked by what ruleset. A later change might
make this easier to maintain over time.
This bug is 1.5-specific, no backport to stable is needed.
In addition to previous outputs, we also emit the cumulated number of
connections, the cumulated number of requests, the maximum allowed
SSL connection concurrency, the current number of SSL connections and
the cumulated number of SSL connections. This will help troubleshoot
systems which experience memory shortage due to SSL.
This function is used to compute the new polling state based on
the previous state. All pollers have to do this in their update
loop, so better centralize the logic for it.
Currently, each poll loop handles the polled events the same way,
resulting in a lot of duplicated, complex code. Additionally, epoll
was the only one to handle newly created FDs immediately.
So instead, let's move that code to fd.c in a new function dedicated
to this task : fd_process_polled_events(). All pollers now use this
function.
This is the reimplementation of the "done" action : when we experience
a short read, we're almost certain that we've exhausted the system's
buffers and that we'll meet an EAGAIN if we attempt to read again. If
the FD is not yet polled, the stream interface already takes care of
stopping the speculative read. When the FD is already being polled, we
have two options :
- either we're running from a level-triggered poller, in which case
we'd rather report that we've reached the end so that we don't
speculate over the poller and let it report next time data are
available ;
- or we're running from an edge-triggered poller in which case we
have no choice and have to see the EAGAIN to re-enable events.
At the moment we don't have any edge-triggered poller, so it's desirable
to avoid speculative I/O that we know will fail.
Note that this must not be ported to SSL since SSL hides the real
readiness of the file descriptor.
Thanks to this change, we observe no EAGAIN anymore during keep-alive
transfers, and failed recvfrom() are reduced by half in http-server-close
mode (the client-facing side is always being polled and the second recv
can be avoided). Doing so results in about 5% performance increase in
keep-alive mode. Similarly, we used to have up to about 1.6% of EAGAIN
on accept() (1/maxaccept), and these have completely disappeared under
high loads.
It's easier and safer to rely on conn_xprt_ready() everywhere than to
check the flag itself. It will also simplify adding extra checks later
if needed. Some useless controls for !xprt have been removed, as the
XPRT_READY flag itself guarantees xprt is set.
It's easier and safer to rely on conn_ctrl_ready() everywhere than to
check the flag itself. It will also simplify adding extra checks later
if needed. Some useless controls for !ctrl have been removed, as the
CTRL_READY flag itself guarantees ctrl is set.
We simply remove these functions and replace their calls with the
appropriate ones :
- if we're in the data phase, we can simply report wait on the FD
- if we're in the socket phase, we may also have to signal the
desire to read/write on the socket because it might not be
active yet.
These flags were used to report the readiness of the file descriptor.
Now this readiness is directly checked at the file descriptor itself.
This removes the need for constantly synchronizing updates between the
file descriptor and the connection and ensures that all layers share
the same level of information.
For now, the readiness is updated in conn_{sock,data}_poll_* by directly
touching the file descriptor. This must move to the lower layers instead
so that these functions can disappear as well. In this state, the change
works but is incomplete. It's sensible enough to avoid making it more
complex.
Now the sock/data updates become much simpler because they just have to
enable/disable access to a file descriptor and not to care anymore about
its readiness.
This commit heavily changes the polling system in order to definitely
fix the frequent breakage of SSL which needs to remember the last
EAGAIN before deciding whether to poll or not. Now we have a state per
direction for each FD, as opposed to a previous and current state
previously. An FD can have up to 8 different states for each direction,
each of which being the result of a 3-bit combination. These 3 bits
indicate a wish to access the FD, the readiness of the FD and the
subscription of the FD to the polling system.
This means that it will now be possible to remember the state of a
file descriptor across disable/enable sequences that generally happen
during forwarding, where enabling reading on a previously disabled FD
would result in forgetting the EAGAIN flag it met last time.
Several new state manipulation functions have been introduced or
adapted :
- fd_want_{recv,send} : enable receiving/sending on the FD regardless
of its state (sets the ACTIVE flag) ;
- fd_stop_{recv,send} : stop receiving/sending on the FD regardless
of its state (clears the ACTIVE flag) ;
- fd_cant_{recv,send} : report a failure to receive/send on the FD
corresponding to EAGAIN (clears the READY flag) ;
- fd_may_{recv,send} : report the ability to receive/send on the FD
as reported by poll() (sets the READY flag) ;
Some functions are used to report the current FD status :
- fd_{recv,send}_active
- fd_{recv,send}_ready
- fd_{recv,send}_polled
Some functions were removed :
- fd_ev_clr(), fd_ev_set(), fd_ev_rem(), fd_ev_wai()
The POLLHUP/POLLERR flags are now reported as ready so that the I/O layers
knows it can try to access the file descriptor to get this information.
In order to simplify the conditions to add/remove cache entries, a new
function fd_alloc_or_release_cache_entry() was created to be used from
pollers while scanning for updates.
The following pollers have been updated :
ev_select() : done, built, tested on Linux 3.10
ev_poll() : done, built, tested on Linux 3.10
ev_epoll() : done, built, tested on Linux 3.10 & 3.13
ev_kqueue() : done, built, tested on OpenBSD 5.2
We're completely changing the way FDs will be polled. There will be no
more speculative I/O since we'll know the exact FD state, so these will
only be cached events.
First, let's fix a few field names which become confusing. "spec_e" was
used to store a speculative I/O event state. Now we'll store the whole
R/W states for the FD there. "spec_p" was used to store a speculative
I/O cache position. Now let's clearly call it "cache".
We're completely changing the way FDs will be polled. First, let's fix
a few field names which become confusing. "spec_e" was used to store a
speculative I/O event state. Now we'll store the whole R/W states for
the FD there.
It is quite often that an connection error only reports "socket error" with
no more information. This is especially problematic with health checks where
many causes are possible, including resource exhaustion which do not lead to
a valid errno code. So let's add explicit codes to cover these cases.
Till now there was no way to know from a connection if a previous
call to drain() had done any change. This function is used to drain
incoming data and to update the connection's flags at the same time.
It also correctly sets the polling flags on the connection if the
drain function indicates inability to receive. This function will
be used preferably over ctrl->drain() when a connection is used.
This reverts commit 1208266356.
It randomly breaks SSL. What happens is that if the SSL response is
read at once by the SSL stack and is partially delivered to the buffer,
then there's no way to read the next parts because we wait for some
polling first.
So we'll fix this after the polling rework.
This function is called twice per request, and does almost always nothing.
Better use an inline version to avoid entering it when we can.
About 0.5% additional performance was gained this way.
si_connect() used to only return SI_ST_CON. But it already detect the
connection reuse and is the function which avoids calling connect().
So it already knows the connection is valid and reuse. Thus we make it
return SI_ST_EST when a connection is reused. This means that
connect_server() can return this state and sess_update_stream_int()
as well.
Thanks to this change, we don't need to leave process_session() in
SI_ST_CON state to immediately enter it again to switch to SI_ST_EST.
Implementing this removes one call to process_session() per request
in keep-alive mode. We're now at 2 calls per request, which is the
minimum (one for the request and another one for the response). The
number of calls to http_wait_for_response() has also dropped from 2
to one.
Tests indicate a performance gain of about 2.6% in request rate in
keep-alive mode. There should be no gain in http-server-close() since
we don't use this faster path.
This reverts commit f3221f99ac.
Igor reported some very strange breakage of his stats page which is
clearly caused by the chunking, though I don't see at first glance
what could be wrong. Better revert it for now.
In theory the principle is simple as we just need to send HTTP chunks
if the client is 1.1 compatible. In practice it's harder because we
have to append a CR LF after each block of data and we're never sure
to have the room for this. In order not to have to deal with this, we
instead send the CR LF prior to each chunk size. The only issue is for
the first chunk and for this reason we avoid to send the empty header
line when using chunked encoding.
If a file descriptor is being polled, and it stopped (eg: buffer full
or end of response), then re-enabled, currently what happens is that
the polling is disabled, then the fd is enabled in speculative mode,
an I/O attempt is made, it loses (otherwise the FD would surely not
have been polled), and the polled is enabled again.
This is too bad, especially with HTTP keep-alive on the server side
where all operations are performed at once before going back to the
poll loop.
Now we improve the behaviour by ensuring that if an fd is still being
polled, when it's enabled after having been disabled, we re-enable the
polling. Doing so saves a number of syscalls and useless wakeups, and
results in a significant performance gain on HTTP keep-alive. A 11%
increase has been observed on the HTTP request rate in keep-alive
thanks to this.
It could be considered as a bug fix, but there was no harm with the
current behaviour, except extra syscalls.
Idle connections are not monitored right now. So if a server closes after
a response without advertising it, it won't be detected until a next
request wants to use the connection. This is a bit problematic because
it unnecessarily maintains file descriptors and sockets in an idle
state.
This patch implements a very simple idle connection manager for the stream
interface. It presents itself as an I/O callback. The HTTP engine enables
it when it recycles a connection. If a close or an error is detected on the
underlying socket, it tries to drain as much data as possible from the socket,
detect the close and responds with a close as well, then detaches from the
stream interface.
The throttling of low weight servers (<16) could mistakenly be reported
as > 100% due to a rounding that was performed before a multiply by 100
instead of after. This was introduced in 1.5-dev20 when fixing a previous
reporting issue by commit d32c399 (MINOR: stats: report correct throttling
percentage for servers in slowstart).
It should be backported if the patch above is backported.
This is the best place to reuse a connection. We centralize all
connection requests and we're at the best place to know exactly
what the current state of the underlying connection is. If the
connection is reused, we just enable polling for send() in order
to be able to emit the request.
When allocating a new connection, only the caller knows whether it's
acceptable to reuse the previous one or not. Let's pass this information
to si_alloc_conn() which will do the cleanup if the connection is not
acceptable.
Right now we see many places doing their own setsockopt(SO_LINGER).
Better only do it just before the close() in fd_delete(). For this
we add a new flag on the file descriptor, indicating if it's safe or
not to linger. If not (eg: after a connect()), then the setsockopt()
call is automatically performed before a close().
The flag automatically turns to safe when receiving a read0.
conn_xprt_ready() reports if the transport layer is ready.
conn_ctrl_ready() reports if the control layer is ready.
The stream interface uses si_conn_ready() to report that the
underlying connection is ready. This will be used for connection
reuse in keep-alive mode.
Doing so ensures that we're consistent between all the functions in the whole
chain. This is important so that we can extract the argument parsing from this
function.
This patch adds map manipulation commands to the socket interface.
add map <map> <key> <value>
Add the value <value> in the map <map>, at the entry corresponding to
the key <key>. This command does not verify if the entry already
exists.
clear map <map>
Remove entries from the map <map>
del map <map> <key>
Delete all the map entries corresponding to the <key> value in the map
<map>.
set map <map> <key> <value>
Modify the value corresponding to each key <key> in a map <map>. The
new value is <value>.
show map [<map>]
Dump info about map converters. Without argument, the list of all
available maps are returned. If a <map> is specified, is content is
dumped.
With this patch, patterns can be compiled for two modes :
- match
- lookup
The match mode is used for example in ACLs or maps. The lookup mode
is used to lookup a key for pattern maintenance. For example, looking
up a network is different from looking up one address belonging to
this network.
A special case is made for regex. In lookup mode they return the input
regex string and do not compile the regex.
Now, the pat_parse_*() functions parses the incoming data. The input
"pattern" struct can be preallocated. If the parser needs to add some
buffers, it allocates memory.
The function pattern_register() runs the call to the parser, process
the key indexation and associate the "sample_storage" used by maps.
This patch remove the compatibility check from the input type and the
match method. Now, it checks if a casts from the input type to output
type exists and the pattern_exec_match() function apply casts before
each pattern matching.
This is used later for increasing the compability with incoming
sample types. When multiple compatible types are supported, one
is arbitrarily used (eg: UINT).
This reduces its size which is not reused by anything else. However it
will significantly improve the debugger's output since we'll now get
real state values.
The default case had to be enabled in the parsers because gcc tries
to optimize the switch/case and noticed some values were missing from
the enums and emitted a warning.
From now on, a call to stream_int_register_handler() causes a call
to si_alloc_appctx() and returns an initialized appctx for the
current stream interface. If one was previously allocated, it is
released. If the stream interface was attached to a connection, it
is released as well.
The appctx are allocated from the same pools as the connections, because
they're substantially smaller in size, and we can't have both a connection
and an appctx on an interface at any moment.
In case of memory shortage, the call may return NULL, which is already
handled by all consumers of stream_int_register_handler().
The field appctx was removed from the stream interface since we only
rely on the endpoint now. On 32-bit, the stream_interface size went down
from 108 to 44 bytes. On 64-bit, it went down from 144 to 64 bytes. This
represents a memory saving of 160 bytes per session.
It seems that a later improvement could be to move the call to
stream_int_register_handler() to session.c for most cases.
The task returned by stream_int_register_handler() is never used, however we
always need to access the appctx afterwards. So make it return the appctx
instead. We already plan for it to fail, which is the reason for the addition
of a few tests and the possibility for the HTTP analyser to return a status
code 500.
We're about to remove si->appctx, so first let's replace all occurrences
of its usage with a dynamic extract from si->end. A lot of code was changed
by search-n-replace, but the behaviour was intentionally not altered.
The code surrounding calls to stream_int_register_handler() was slightly
changed since we can only use si->end *after* the registration.
We used to have two very similar functions for sending a PROXY protocol
line header. The reason is that the default one relies on the stream
interface to retrieve the other end's address, while the "local" one
performs a local address lookup and sends that instead (used by health
checks).
Now that the send_proxy_ofs is stored in the connection and not the
stream interface, we can make the local_send_proxy rely on it and
support partial sends. This also simplifies the code by removing the
local_send_proxy function, making health checks use send_proxy_ofs,
resulting in the removal of the CO_FL_LOCAL_SPROXY flag, and the
associated test in the connection handler. The other flag,
CO_FL_SI_SEND_PROXY was renamed without the "SI" part so that it
is clear that it is not dedicated anymore to a usage with a stream
interface.
Till now the send_proxy_ofs field remained in the stream interface,
but since the dynamic allocation of the connection, it makes a lot
of sense to move that into the connection instead of the stream
interface, since it will not be statically allocated for each
session.
Also, it turns out that moving it to the connection fils an alignment
hole on 64 bit architectures so it does not consume more memory, and
removing it from the stream interface was an opportunity to correctly
reorder fields and reduce the stream interface's size from 160 to 144
bytes (-10%). This is 32 bytes saved per session.
The outgoing connection is now allocated dynamically upon the first attempt
to touch the connection's source or destination address. If this allocation
fails, we fail on SN_ERR_RESOURCE.
As we didn't use si->conn anymore, it was removed. The endpoints are released
upon session_free(), on the error path, and upon a new transaction. That way
we are able to carry the existing server's address across retries.
The stream interfaces are not initialized anymore before session_complete(),
so we could even think about allocating them dynamically as well, though
that would not provide much savings.
The session initialization now makes use of conn_new()/conn_free(). This
slightly simplifies the code and makes it more logical. The connection
initialization code is now shorter by about 120 bytes because it's done
at once, allowing the compiler to remove all redundant initializations.
The si_attach_applet() function now takes care of first detaching the
existing endpoint, and it is called from stream_int_register_handler(),
so we can safely remove the calls to si_release_endpoint() in the
application code around this call.
A call to si_detach() was made upon stream_int_unregister_handler() to
ensure we always free the allocated connection if one was allocated in
parallel to setting an applet (eg: detect HTTP proxy while proceeding
with stats maybe).
si_prepare_conn() is not appropriate in our case as it both initializes and
attaches the connection to the stream interface. Due to the asymmetry between
accept() and connect(), it causes some fields such as the control and transport
layers to be reinitialized.
Now that we can separately initialize these fields using conn_prepare(), let's
break this function to only attach the connection to the stream interface.
Also, by analogy, si_prepare_none() was renamed si_detach(), and
si_prepare_applet() was renamed si_attach_applet().
We don't want to assign the control nor transport layers anymore
at the same time as the data layer, because it prevents one from
keeping existing settings when reattaching a connection to an
existing stream interface.
Let's have conn_attach() replace conn_assign() for this purpose.
Thus, conn_prepare() + conn_attach() do exactly the same as the
previous conn_assign().
Now that we can assign conn->xprt regardless of the initialization state,
we can reintroduce conn_prepare() to set only the protocol, the transport
layer and initialize the transport layer's state.
The first function is used to (re)initialize a stream interface and
the second to force it into a known state. These are intended for
cleaning up the stream interface initialization code in session.c
and peers.c and avoiding future issues with missing initializations.
Currently the control and transport layers of a connection are supposed
to be initialized when their respective pointers are not NULL. This will
not work anymore when we plan to reuse connections, because there is an
asymmetry between the accept() side and the connect() side :
- on accept() side, the fd is set first, then the ctrl layer then the
transport layer ; upon error, they must be undone in the reverse order,
then the FD must be closed. The FD must not be deleted if the control
layer was not yet initialized ;
- on the connect() side, the fd is set last and there is no reliable way
to know if it has been initialized or not. In practice it's initialized
to -1 first but this is hackish and supposes that local FDs only will
be used forever. Also, there are even less solutions for keeping trace
of the transport layer's state.
Also it is possible to support delayed close() when something (eg: logs)
tracks some information requiring the transport and/or control layers,
making it even more difficult to clean them.
So the proposed solution is to add two flags to the connection :
- CO_FL_CTRL_READY is set when the control layer is initialized (fd_insert)
and cleared after it's released (fd_delete).
- CO_FL_XPRT_READY is set when the control layer is initialized (xprt->init)
and cleared after it's released (xprt->close).
The functions have been adapted to rely on this and not on the pointers
anymore. conn_xprt_close() was unused and dangerous : it did not close
the control layer (eg: the socket itself) but still marks the transport
layer as closed, preventing any future call to conn_full_close() from
finishing the job.
The problem comes from conn_full_close() in fact. It needs to close the
xprt and ctrl layers independantly. After that we're still having an issue :
we don't know based on ->ctrl alone whether the fd was registered or not.
For this we use the two new flags CO_FL_XPRT_READY and CO_FL_CTRL_READY. We
now rely on this and not on conn->xprt nor conn->ctrl anymore to decide what
remains to be done on the connection.
In order not to miss some flag assignments, we introduce conn_ctrl_init()
to initialize the control layer, register the fd using fd_insert() and set
the flag, and conn_ctrl_close() which unregisters the fd and removes the
flag, but only if the transport layer was closed.
Similarly, at the transport layer, conn_xprt_init() calls ->init and sets
the flag, while conn_xprt_close() checks the flag, calls ->close and clears
the flag, regardless xprt_ctx or xprt_st. This also ensures that the ->init
and the ->close functions are called only once each and in the correct order.
Note that conn_xprt_close() does nothing if the transport layer is still
tracked.
conn_full_close() now simply calls conn_xprt_close() then conn_full_close()
in turn, which do nothing if CO_FL_XPRT_TRACKED is set.
In order to handle the error path, we also provide conn_force_close() which
ignores CO_FL_XPRT_TRACKED and closes the transport and the control layers
in turns. All relevant instances of fd_delete() have been replaced with
conn_force_close(). Now we always know what state the connection is in and
we can expect to split its initialization.
conn_new() will be a more convenient way of allocating and initializing
a connection. It calls pool_alloc2() and conn_init() upon success.
conn_free() is just a pool_free2() but is provided for symmetry with
conn_new().
Everywhere conn_prepare() is used, the call to conn_init() has already
been done. We can now safely replace all instances of conn_prepare()
with conn_assign() which does not reset the transport layer, and remove
conn_prepare().
This function will ease the initialization of new connections as well
as their reuse. It initializes the obj_type and a few fields so that
the connection is fresh again. It leaves the addresses and target
untouched so it is suitable for use across connection retries.
The connection will only remain there as a pre-allocated entity whose
goal is to be placed in ->end when establishing an outgoing connection.
All connection initialization can be made on this connection, but all
information retrieved should be applied to the end point only.
This change is huge because there were many users of si->conn. Now the
only users are those who initialize the new connection. The difficulty
appears in a few places such as backend.c, proto_http.c, peers.c where
si->conn is used to hold the connection's target address before assigning
the connection to the stream interface. This is why we have to keep
si->conn for now. A future improvement might consist in dynamically
allocating the connection when it is needed.
This function makes no sense anymore and will cause trouble to convert
the remains of connection/applet to end points. Let's replace it now
with its contents.
The long-term goal is to have a context for applets as an alternative
to the connection and not as a complement. At the moment, the context
is still stored into the stream interface, and we only put a pointer
to the applet's context in si->end, initialize the context with object
type OBJ_TYPE_APPCTX, and this allows us not to allocate an entry when
deciding to switch to an applet.
A special care is taken to never dereference si->conn anymore when
dealing with an applet. That's why it's important that si->end is
always set to the proper type :
si->end == NULL => not connected to anything
*si->end == OBJ_TYPE_APPCTX => connected to an applet
*si->end == OBJ_TYPE_CONN => real connection (server, proxy, ...)
The session management code used to check the applet from the connection's
target. Now it uses the stream interface's end point and does not touch the
connection at all. Similarly, we stop checking the connection's addresses
and file descriptors when reporting the applet's status in the stats dump.
Since last commit, we now have a pointer to the applet in the
applet context. So we don't need the si->release function pointer
anymore, it can be extracted from applet->applet.release. At many
places, the ->release function was still tested for real connections
while it is only limited to applets, so most of them were simply
removed. For the remaining valid uses, a new inline function
si_applet_release() was added to simplify the check and the call.
In preparation for a later move of all the applet context outside of the
stream interface, we'll need to have access to the applet itself from the
context. Let's have a pointer to it inside the context.
si_prepare_embedded() was used both to attach an applet and to detach
anything from a stream interface. Split it into si_prepare_none() to
detach and si_prepare_applet() to attach an applet.
si->conn->target is now assigned from within these two functions instead
of their respective callers.
The object type was added to "struct appctx". The purpose will be
to identify an appctx when the applet context is detached from the
stream interface. For now, it's still attached, so this patch only
adds the new type and does not replace its use.
Most of the times, the caller of objt_<type>(ptr) will know that <ptr>
is valid and of the correct type (eg: in an "if" condition). Let's provide
an unsafe variant that does not perform the check again for these usages.
The new functions are called "__objt_<type>".
This is to be more consistent with the other functions. The only
reason why these functions used to return a value was to let the
caller adjust polling by itself, but now their only callers were
the si_shutr()/si_shutw() inline functions. Now these functions
do not depend anymore on the connection.
These connection variant of these functions now call
conn_data_stop_recv()/conn_data_stop_send() before returning order
not to require a return code anymore. The applet version does not
need this at all.
These functions induce a lot of ifs everywhere because they consider two
different cases, one which is where the connection exists and has a file
descriptor, and the other one which is the default case where at most an
applet has to be notified.
Let's have them in si_ops and automatically decide which one to use.
The connection shutdown sequence has been slightly simplified, and we
now clear the flags at the end.
Also we remove SHUTR_NOW after a shutw with nolinger, as it's cleaner
not to keep it.
There is a big trouble with the way POST is handled for the admin
stats page. The POST parameters are extracted from some http-request
rules, and if not round they return zero hoping for being called again
when more data passes. This results in the HTTP analyser being called
several times and all the rules prior to the stats being executed
multiple times as well. That includes rewrite rules.
So instead of doing this, we now move all the processing of the stats
into the stats applet.
That way we just set the stats applet in the HTTP analyser when a stats
request is detected, and the applet takes the time it needs to read the
arguments and respond. We could even imagine improving the applet to
support requests larger than a single buffer.
The code was almost only moved and minimally changed. Several new HTTP
states were added to the stats applet to emit headers, redirects and
to read POST. It was necessary to do this because the headers sent
depend on the parsing of the POST request. In the end it's beneficial
because we removed two stream_int_retnclose() calls.
In preparation for moving the POST processing to the applet, we first
add new states to the HTTP I/O handler. Till now st0 was only 0/1 for
start/end. We now replace it with an enum.
We handle "http-request redirect" with a log-format string now, but we
leave "redirect" unaffected.
Note that the control of the special "/" case is move from the runtime
execution to the configuration parsing. If the format rule list is
empty, the build_logline() function does nothing.
We now have the following enums and all related functions return them and
consume them :
enum pat_match_res {
PAT_NOMATCH = 0, /* sample didn't match any pattern */
PAT_MATCH = 3, /* sample matched at least one pattern */
};
enum acl_test_res {
ACL_TEST_FAIL = 0, /* test failed */
ACL_TEST_MISS = 1, /* test may pass with more info */
ACL_TEST_PASS = 3, /* test passed */
};
enum acl_cond_pol {
ACL_COND_NONE, /* no polarity set yet */
ACL_COND_IF, /* positive condition (after 'if') */
ACL_COND_UNLESS, /* negative condition (after 'unless') */
};
It's just in order to avoid doubts when reading some code.
This patch just renames functions, types and enums. No code was changed.
A significant number of files were touched, especially the ACL arrays,
so it is likely that some external patches will not apply anymore.
One important thing is that we had to split ACL_PAT_* into two groups :
- ACL_TEST_{PASS|MISS|FAIL}
- PAT_{MATCH|UNMATCH}
A future patch will enforce enums on all these places to avoid confusion.
This patch just moves code without any change.
The ACL are just the association between sample and pattern. The pattern
contains the match method and the parse method. These two things are
different. This patch cleans the code by splitting it.
This will be used later with maps. Each map will associate an entry with
a sample_storage value.
This patch changes the "parse" prototype and all the parsing methods.
The goal is to associate "struct sample_storage" to each entry of
"struct acl_pattern". Only the "parse" function can add the sample value
into the "struct acl_pattern".
The map feature will need to match acl patterns. This patch extracts
the matching function from the global ACL function "acl_exec_cond".
The code was only moved to its own function, no functional changes were made.
With this split, the pattern indexation can apply to any source. The map
feature needs this functionality because the map cannot be loaded with the
same file format as the ones supported by acl_read_patterns_from_file().
The code was only moved to its own function, no functional changes were made.
If the acl keyword is a "fetch", the dedicated parsing function
"sample_parse_expr()" is used. Otherwise, the acl parsing function
"parse_acl_expr()" is extended to understand the syntax of a series
of converters placed after the "fetch" keyword.
Before this patch, each acl uses a "struct sample_fetch" and executes
it with the "<fetch>->process()" function. Now, the dedicated function
"sample_process()" is called.
These syntax are now avalaible:
acl bad req.hdr(host),lower -m str www
http-request redirect prefix /go-away if bad
acl bad hdr_beg(host),lower www
http-request redirect prefix /go-away if bad
When parsing track-sc* actions in tcp-request rules, we now automatically
compute the track-sc identifier number using %d when displaying an error
message. But the ID has become wrong since we introduced sc0, we continue
to report id+1 in error messages causing some confusion.
No backport is needed.
Add a DRAIN sub-state for a server which
will be shown on the stats page instead of UP if
its effective weight is zero.
Also, log if a server enters or leaves the DRAIN state
as the result of an agent check.
Signed-off-by: Simon Horman <horms@verge.net.au>
The column used to report the throttle percentage when a server is in
slowstart is based on the time only. This is wrong, because server weights
in slowstart are updated at most once a second, so the reported value is
wrong at least fo rone second during each step, which means all the time
when using short delays (< 20s).
The second point is that it's disturbing to see a weight < 100% without
any throttle at the end of the period (during the last second), because
the effective weight has not yet been updated.
Instead, we now compute the exact ratio between eweight and uweight and
report it. It's always accurate and describes the value being used instead
of using only the date.
It can be backported to 1.4 though it's not particularly important.
A crash was reported by Igor at owind when changing a server's weight
on the CLI. Lukas Tribus could reproduce a related bug where setting
a server's weight would result in the new weight being multiplied by
the initial one. The two bugs are the same.
The incorrect weight calculation results in the total farm weight being
larger than what was initially allocated, causing the map index to be out
of bounds on some hashes. It's easy to reproduce using "balance url_param"
with a variable param, or with "balance static-rr".
It appears that the calculation is made at many places and is not always
right and not always wrong the same way. Thus, this patch introduces a
new function "server_recalc_eweight()" which is dedicated to this task
of computing ->eweight from many other elements including uweight and
current time (for slowstart), and all users now switch to use this
function.
The patch is a bit large but the code was not trivially fixable in a way
that could guarantee this situation would not occur anymore. The fix is
much more readable and has been verified to work with all algorithms,
with both consistent and map-based hashes, and even with static-rr.
Slowstart was tested as well, just like enable/disable server.
The same bug is very likely present in 1.4 as well, so the patch will
probably need to be backported eventhough it will not apply as-is.
Thanks to Lukas and Igor for the information they provided to reproduce it.
Paramatise the following functions over the check of a server
* set_server_down
* set_server_up
* srv_getinter
* server_status_printf
* set_server_check_status
* set_server_disabled
* set_server_enabled
Generally the server parameter of these functions has been removed.
Where it is still needed it is obtained using check->server.
This is in preparation for associating a agent check
with a server which runs as well as the server's existing check.
By paramatising these functions they may act on each of the checks
without further significant modification.
Explanation of the SSP_O_HCHK portion of this change:
* Prior to this patch SSP_O_HCHK serves a single purpose which
is to tell server_status_printf() weather it should print
the details of the check of a server or not.
With the paramatisation that this patch adds there are two cases.
1) Printing the details of the check in which case a
valid check parameter is needed.
2) Not printing the details of the check in which case
the contents check parameter are unused.
In case 1) we could pass SSP_O_HCHK and a valid check and;
In case 2) we could pass !SSP_O_HCHK and any value for check
including NULL.
If NULL is used for case 2) then SSP_O_HCHK becomes supurfulous
and as NULL is used for case 2) SSP_O_HCHK has been removed.
Signed-off-by: Simon Horman <horms@verge.net.au>
When a process with large stick tables is replaced by a new one and remains
present until the last connection finishes, it keeps these data in memory
for nothing since they will never be used anymore by incoming connections,
except during syncing with the new process. This is especially problematic
when dealing with long session protocols such as WebSocket as it becomes
possible to stack many processes and eat a lot of memory.
So the idea here is to know if a table still needs to be synced or not,
and to purge all unused entries once the sync is complete. This means that
after a few hundred milliseconds when everything has been synchronized with
the new process, only a few entries will remain allocated (only the ones
held by sessions during the restart) and all the remaining memory will be
freed.
Note that we carefully do that only after the grace period is expired so as
not to impact a possible proxy that needs to accept a few more connections
before leaving.
Doing this required to add a sync counter to the stick tables, to know how
many peer sync sessions are still in progress in order not to flush the entries
until all synchronizations are completed.
In preparation of more flexibility in the stick counters, make their
number configurable. It still defaults to 3 which is the minimum
accepted value. Changing the value alone is not sufficient to get
more counters, some bitfields still need to be updated and the TCP
actions need to be updated as well, but this update tries to be
easier, which is nice for experimentation purposes.
This function is also called directly from backend.c, so let's stop
building fake args to call it as a sample fetch, and have a lower
layer more generic function instead.
We're having a lot of duplicate code just because of minor variants between
fetch functions that could be dealt with if the functions had the pointer to
the original keyword, so let's pass it as the last argument. An earlier
version used to pass a pointer to the sample_fetch element, but this is not
the best solution for two reasons :
- fetch functions will solely rely on the keyword string
- some other smp_fetch_* users do not have the pointer to the original
keyword and were forced to pass NULL.
So finally we're passing a pointer to the keyword as a const char *, which
perfectly fits the original purpose.
Remove event_accept() in include/proto/proto_http.h and use correct function
name in other two files instead of event_accept().
Signed-off-by: Godbach <nylzhaowei@gmail.com>
It was a bit inconsistent to have gpc start at 0 and sc start at 1,
so make sc start at zero like gpc. No previous release was issued
with sc3 anyway, so no existing setup should be affected.
Some actions were clearly missing to process response headers. This
patch adds a new "http-response" ruleset which provides the following
actions :
- allow : stop evaluating http-response rules
- deny : stop and reject the response with a 502
- add-header : add a header in log-format mode
- set-header : set a header in log-format mode
Since commit cfd97c6f was merged into 1.5-dev14 (BUG/MEDIUM: checks:
prevent TIME_WAITs from appearing also on timeouts), some valid health
checks sometimes used to show some TCP resets. For example, this HTTP
health check sent to a local server :
19:55:15.742818 IP 127.0.0.1.16568 > 127.0.0.1.8000: S 3355859679:3355859679(0) win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7>
19:55:15.742841 IP 127.0.0.1.8000 > 127.0.0.1.16568: S 1060952566:1060952566(0) ack 3355859680 win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7>
19:55:15.742863 IP 127.0.0.1.16568 > 127.0.0.1.8000: . ack 1 win 257
19:55:15.745402 IP 127.0.0.1.16568 > 127.0.0.1.8000: P 1:23(22) ack 1 win 257
19:55:15.745488 IP 127.0.0.1.8000 > 127.0.0.1.16568: FP 1:146(145) ack 23 win 257
19:55:15.747109 IP 127.0.0.1.16568 > 127.0.0.1.8000: R 23:23(0) ack 147 win 257
After some discussion with Chris Huang-Leaver, it appeared clear that
what we want is to only send the RST when we have no other choice, which
means when the server has not closed. So we still keep SYN/SYN-ACK/RST
for pure TCP checks, but don't want to see an RST emitted as above when
the server has already sent the FIN.
The solution against this consists in implementing a "drain" function at
the protocol layer, which, when defined, causes as much as possible of
the input socket buffer to be flushed to make recv() return zero so that
we know that the server's FIN was received and ACKed. On Linux, we can make
use of MSG_TRUNC on TCP sockets, which has the benefit of draining everything
at once without even copying data. On other platforms, we read up to one
buffer of data before the close. If recv() manages to get the final zero,
we don't disable lingering. Same for hard errors. Otherwise we do.
In practice, on HTTP health checks we generally find that the close was
pending and is returned upon first recv() call. The network trace becomes
cleaner :
19:55:23.650621 IP 127.0.0.1.16561 > 127.0.0.1.8000: S 3982804816:3982804816(0) win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7>
19:55:23.650644 IP 127.0.0.1.8000 > 127.0.0.1.16561: S 4082139313:4082139313(0) ack 3982804817 win 32792 <mss 16396,nop,nop,sackOK,nop,wscale 7>
19:55:23.650666 IP 127.0.0.1.16561 > 127.0.0.1.8000: . ack 1 win 257
19:55:23.651615 IP 127.0.0.1.16561 > 127.0.0.1.8000: P 1:23(22) ack 1 win 257
19:55:23.651696 IP 127.0.0.1.8000 > 127.0.0.1.16561: FP 1:146(145) ack 23 win 257
19:55:23.652628 IP 127.0.0.1.16561 > 127.0.0.1.8000: F 23:23(0) ack 147 win 257
19:55:23.652655 IP 127.0.0.1.8000 > 127.0.0.1.16561: . ack 24 win 257
This change should be backported to 1.4 which is where Chris encountered
this issue. The code is different, so probably the tcp_drain() function
will have to be put in the checks only.
The req.hdr and res.hdr fetch methods do not work well on headers which
are allowed to contain commas, such as User-Agent, Date or Expires.
More specifically, full-length matching is impossible if a comma is
present.
This patch introduces 4 new fetch functions which are designed to work
with these full-length headers :
- req.fhdr, req.fhdr_cnt
- res.fhdr, res.fhdr_cnt
These ones do not stop at commas and permit to return full-length header
values.
Since 1.5-dev12 and commit 3bf1b2b8 (MAJOR: channel: stop relying on
BF_FULL to take action), the HTTP parser switched to channel_full()
instead of BF_FULL to decide whether a buffer had enough room to start
parsing a request or response. The problem is that channel_full()
intentionally ignores outgoing data, so a corner case exists where a
large response might still be left in a response buffer with just a
few bytes left (much less than the reserve), enough to accept a second
response past the last data, but not enough to permit the HTTP processor
to add some headers. Since all the processing relies on this space being
available, we can get some random crashes when clients pipeline requests.
The analysis of a core from haproxy configured with 20480 bytes buffers
shows this : with enough "luck", when sending back the response for the
first request, the client is slow, the TCP window is congested, the socket
buffers are full, and haproxy's buffer fills up. We still have 20230 bytes
of response data in a 20480 response buffer. The second request is sent to
the server which returns 214 bytes which fit in the small 250 bytes left
in this buffer. And the buffer arrangement makes it possible to escape all
the controls in http_wait_for_response() :
|<------ response buffer = 20480 bytes ------>|
[ 2/2 | 3 | 4 | 1/2 ]
^ start of circular buffer
1/2 = beginning of previous response (18240)
2/2 = end of previous response (1990)
3 = current response (214)
4 = free space (36)
- channel_full() returns false (20230 bytes are going to leave)
- the response headers does not wrap at the end of the buffer
- the remaining linear room after the headers is larger than the
reserve, because it's the previous response which wraps :
=> response is processed
Header rewriting causes it to reach 260 bytes, 10 bytes larger than what
the buffer could hold. So all computations during header addition are
wrong and lead to the corruption we've observed.
All the conditions are very hard to meet (which explains why it took
almost one year for this bug to show up) and are almost impossible to
reproduce on purpose on a test platform. But the bug is clearly there.
This issue was reported by Dinko Korunic who kindly devoted a lot of
time to provide countless traces and cores, and to experiment with
troubleshooting patches to knock the bug down. Thanks Dinko!
No backport is needed, but all 1.5-dev versions between dev12 and dev18
included must be upgraded. A workaround consists in setting option
forceclose to prevent pipelined requests from being processed.
By properly affecting the flags and values, it becomes easier to add
more tracked counters, for example for experimentation. It also slightly
reduces the code and the number of tests. No counters were added with
this patch.
This patch adds a "scope" box in the statistics page in order to
display only proxies with a name that contains the requested value.
The scope filter is preserved across all clicks on the page.
While ACL args were resolved after all the config was parsed, it was not the
case with sample fetch args because they're almost everywhere now.
The issue is that ACLs now solely rely on sample fetches, so their args
resolving doesn't work anymore. And many fetches involving a server, a
proxy or a userlist don't work at all.
The real issue is that at the bottom layers we have no information about
proxies, line numbers, even ACLs in order to report understandable errors,
and that at the top layers we have no visibility over the locations where
fetches are referenced (think log node).
After failing multiple unsatisfying solutions attempts, we now have a new
concept of args list. The principle is that every proxy has a list head
which contains a number of indications such as the config keyword, the
context where it's used, the file and line number, etc... and a list of
arguments. This list head is of the same type as the elements, so it
serves as a template for adding new elements. This way, it is filled from
top to bottom by the callers with the information they have (eg: line
numbers, ACL name, ...) and the lower layers just have to duplicate it and
add an element when they face an argument they cannot resolve yet.
Then at the end of the configuration parsing, a loop passes over each
proxy's list and resolves all the args in sequence. And this way there is
all necessary information to report verbose errors.
The first immediate benefit is that for the first time we got very precise
location of issues (arg number in a keyword in its context, ...). Second,
in order to do this we had to parse log-format and unique-id-format a bit
earlier, so that was a great opportunity for doing so when the directives
are encountered (unless it's a default section). This way, the recorded
line numbers for these args are the ones of the place where the log format
is declared, not the end of the file.
Userlists report slightly more information now. They're the only remaining
ones in the ACL resolving function.
The acl_expr struct used to hold a pointer to the ACL keyword. But since
we now have all the relevant pointers, we don't need that anymore, we just
need the pointer to the keyword as a string in order to return warnings
and error messages.
So let's change this in order to remove the dependency on the acl_keyword
struct from acl_expr.
During this change, acl_cond_kw_conflicts() used to return a pointer to an
ACL keyword but had to be changed to return a const char* for the same reason.
The ACLs now use the fetch's ->use and ->val to decide upon compatibility
between the place where they are used and where the information are fetched.
The code is capable of reporting warnings about very fine incompatibilities
between certain fetches and an exact usage location, so it is expected that
some new warnings will be emitted on some existing configurations.
Two degrees of detection are provided :
- detecting ACLs that never match
- detecting keywords that are ignored
All tests show that this seems to work well, though bugs are still possible.
Proxy's acl_requires was a copy of all bits taken from ACLs, but we'll
get rid of ACL flags and only rely on sample fetches soon. The proxy's
acl_requires was only used to allocate an HTTP context when needed, and
was even forced in HTTP mode. So better have a flag which exactly says
what it's supposed to be used for.
ACL fetch functions used to directly reference a fetch function. Now
that all ACL fetches have their sample fetches equivalent, we can make
ACLs reference a sample fetch keyword instead.
In order to simplify the code, a sample keyword name may be NULL if it
is the same as the ACL's, which is the most common case.
A minor change appeared, http_auth always expects one argument though
the ACL allowed it to be missing and reported as such afterwards, so
fix the ACL to match this. This is not really a bug.
The file acl.c is a real mess, it both contains functions to parse and
process ACLs, and some sample extraction functions which act on buffers.
Some other payload analysers were arbitrarily dispatched to proto_tcp.c.
So now we're moving all payload-based fetches and ACLs to payload.c
which is capable of extracting data from buffers and rely on everything
that is protocol-independant. That way we can safely inflate this file
and only use the other ones when some fetches are really specific (eg:
HTTP, SSL, ...).
As a result of this cleanup, the following new sample fetches became
available even if they're not really useful :
always_false, always_true, rep_ssl_hello_type, rdp_cookie_cnt,
req_len, req_ssl_hello_type, req_ssl_sni, req_ssl_ver, wait_end
The function 'acl_fetch_nothing' was wrong and never used anywhere so it
was removed.
The "rdp_cookie" sample fetch used to have a mandatory argument while it
was optional in ACLs, which are supposed to iterate over RDP cookies. So
we're making it optional as a fetch too, and it will return the first one.
If a log-format involves some sample fetches that may not be present at
the logging instant, we can now report a warning.
Note that this is done both for log-format and for add-header and carefully
respects the original fetch keyword's capabilities.
Samples fetches were relying on two flags SMP_CAP_REQ/SMP_CAP_RES to describe
whether they were compatible with requests rules or with response rules. This
was never reliable because we need a finer granularity (eg: an HTTP request
method needs to parse an HTTP request, and is available past this point).
Some fetches are also dependant on the context (eg: "hdr" uses request or
response depending where it's involved, causing some abiguity).
In order to solve this, we need to precisely indicate in fetches what they
use, and their users will have to compare with what they have.
So now we have a bunch of bits indicating where the sample is fetched in the
processing chain, with a few variants indicating for some of them if it is
permanent or volatile (eg: an HTTP status is stored into the transaction so
it is permanent, despite being caught in the response contents).
The fetches also have a second mask indicating their validity domain. This one
is computed from a conversion table at registration time, so there is no need
for doing it by hand. This validity domain consists in a bitmask with one bit
set for each usage point in the processing chain. Some provisions were made
for upcoming controls such as connection-based TCP rules which apply on top of
the connection layer but before instantiating the session.
Then everywhere a fetch is used, the bit for the control point is checked in
the fetch's validity domain, and it becomes possible to finely ensure that a
fetch will work or not.
Note that we need these two separate bitfields because some fetches are usable
both in request and response (eg: "hdr", "payload"). So the keyword will have
a "use" field made of a combination of several SMP_USE_* values, which will be
converted into a wider list of SMP_VAL_* flags.
The knowledge of permanent vs dynamic information has disappeared for now, as
it was never used. Later we'll probably reintroduce it differently when
dealing with variables. Its only use at the moment could have been to avoid
caching a dynamic rate measurement, but nothing is cached as of now.
Break out set weight processing code.
This is in preparation for reusing the code.
Also, remove duplicate check in nested if clauses.
{px->lbprm.algo & BE_LB_PROP_DYN) is checked by
the immediate outer if clause, so there is no need
to check it a second time.
Signed-off-by: Simon Horman <horms@verge.net.au>
Commit 2b0108ad accidently got rid of the ability to emit a "-" for
empty log fields. This can happen for captured request and response
cookies, as well as for fetches. Since we don't want to have this done
for headers however, we set the default log method when parsing the
format. It is still possible to force the desired mode using +M/-M.
We now have http_apply_redirect_rule() which does all the redirect-specific
job instead of having this inside http_process_req_common().
Also one of the benefit gained from uniformizing this code is that both
keep-alive and close response do emit the PR-- flags. The fix for the
flags could probably be backported to 1.4 though it's very minor.
The previous function http_perform_redirect() was becoming confusing
so it was renamed http_perform_server_redirect() since it only applies
to server-based redirection.
At the moment, we need trash chunks almost everywhere and the only
correctly implemented one is in the sample code. Let's move this to
the chunks so that all other places can use this allocator.
Additionally, the get_trash_chunk() function now really returns two
different chunks. Previously it used to always overwrite the same
chunk and point it to a different buffer, which was a bit tricky
because it's not obvious that two consecutive results do alias each
other.
The dumpstats code looks like a spaghetti plate. Several functions are
supposed to be able to do several things but rely on complex states to
dispatch the work to independant functions. Most of the HTML output is
performed within the switch/case statements of the whole state machine.
Let's clean this up by adding new functions to emit the data and have
a few more iterators to avoid relying on so complex states.
The new stats dump sequence looks like this for CLI and for HTTP :
cli_io_handler()
-> stats_dump_sess_to_buffer() // "show sess"
-> stats_dump_errors_to_buffer() // "show errors"
-> stats_dump_raw_info_to_buffer() // "show info"
-> stats_dump_raw_info()
-> stats_dump_raw_stat_to_buffer() // "show stat"
-> stats_dump_csv_header()
-> stats_dump_proxy()
-> stats_dump_px_hdr()
-> stats_dump_fe_stats()
-> stats_dump_li_stats()
-> stats_dump_sv_stats()
-> stats_dump_be_stats()
-> stats_dump_px_end()
http_stats_io_handler()
-> stats_http_redir()
-> stats_dump_http() // also emits the HTTP headers
-> stats_dump_html_head() // emits the HTML headers
-> stats_dump_csv_header() // emits the CSV headers (same as above)
-> stats_dump_http_info() // note: ignores non-HTML output
-> stats_dump_proxy() // same as above
-> stats_dump_http_end() // emits HTML trailer
The log-format parser reached a limit making it hard to add new features.
It also suffers from a weak handling of certain incorrect corner cases,
for example "%{foo}" is emitted as a litteral while syntactically it's an
argument to no variable. Also the argument parser had to redo some of the
job with some cases causing minor memory leaks (eg: ignored args).
This work aims at improving the situation so that slightly better reporting
is possible and that it becomes possible to extend the log format. The code
has a few more states but looks significantly simpler. The parser is now
capable of reporting ignored arguments and truncated lines.
stream_int_chk_rcv_conn() did not clear connection flags before updating them. It
is unsure whether this could have caused the stalled transfers that have been
reported since dev15.
In order to avoid such further issues, we now use a simple inline function to do
all the job.
Looking at the assembly code that updt_fd() and alloc/release_spec_entry
produce in the polling loops, it's clear that gcc has to recompute pointers
several times in a row because of limited spare registers. By better
grouping adjacent structure updates, we improve the code size by around
60 bytes in the fast path on x86.
The stick counters were in two distinct sets of struct members,
causing some code to be duplicated. Now we use an array, which
enables some processing to be performed in loops. This allowed
the code to be shrunk by 700 bytes.
Until now it was only possible to use track-sc1/sc2 with "src" which
is the IPv4 source address. Now we can use track-sc1/sc2 with any fetch
as well as any transformation type. It works just like the "stick"
directive.
Samples are automatically converted to the correct types for the table.
Only "tcp-request content" rules may use L7 information, and such information
must already be present when the tracking is set up. For example it becomes
possible to track the IP address passed in the X-Forwarded-For header.
HTTP request processing now also considers tracking from backend rules
because we want to be able to update the counters even when the request
was already parsed and tracked.
Some more controls need to be performed (eg: samples do not distinguish
between L4 and L6).
Sessions using client certs are huge (more than 1 kB) and do not fit
in session cache, or require a huge cache.
In this new implementation sshcachesize set a number of available blocks
instead a number of available sessions.
Each block is large enough (128 bytes) to store a simple session (without
client certs).
Huge sessions will take multiple blocks depending on client certificate size.
Note: some unused code for session sync with remote peers was temporarily
removed.
When the PROXY protocol header is expected and fails, leading to an
abort of the incoming connection, we now emit a log message. If option
dontlognull is set and it was just a port probe, then nothing is logged.
Since the introduction of SSL, it became quite annoying not to get any useful
info in logs about handshake failures. Let's improve reporting for embryonic
sessions by checking a per-connection error code and reporting it into the logs
if an error happens before the session is completely instanciated.
The "dontlognull" option is supported in that if a connection does not talk
before being aborted, nothing will be emitted.
At the moment, only timeouts are considered for SSL and the PROXY protocol,
but next patches will handle more errors.
Commit 0ffde2cc in 1.5-dev13 tried to always disable polling on file
descriptors when errors were encountered. Unfortunately it did not
always succeed in doing so because it relied on detecting polling
changes to disable it. Let's use a dedicated conn_stop_polling()
function that is inconditionally called upon error instead.
This managed to stop a busy loop observed when a health check makes
use of the send-proxy protocol and fails before the connection can
be established.
Commit 24db47e0 tried to improve support for delayed ACK upon connect
but it was incomplete, because checks with the proxy protocol would
always enable polling for data receive and there was no way of
distinguishing data polling and delayed ack.
So we add a distinct delack flag to the connect() function so that
the caller decides whether or not to use a delayed ack regardless
of pending data (eg: when send-proxy is in use). Doing so covers all
combinations of { (check with data), (sendproxy), (smart-connect) }.
Several places got the connection close sequence wrong because it
was not obvious. In practice we always need the same sequence when
aborting, so let's have a common function for this.
New option 'maxcompcpuusage' in global section.
Sets the maximum CPU usage HAProxy can reach before stopping the
compression for new requests or decreasing the compression level of
current requests. It works like 'maxcomprate' but with the Idle.
This patch makes changes in the http_response_forward_body state
machine. It checks if the compress algorithm had consumed data before
swapping the temporary and the input buffer. So it prevents null sized
zlib chunks.
Instead of storing a couple of (int, ptr) in the struct connection
and the struct session, we use a different method : we only store a
pointer to an integer which is stored inside the target object and
which contains a unique type identifier. That way, the pointer allows
us to retrieve the object type (by dereferencing it) and the object's
address (by computing the displacement in the target structure). The
NULL pointer always corresponds to OBJ_TYPE_NONE.
This reduces the size of the connection and session structs. It also
simplifies target assignment and compare.
In order to improve the generated code, we try to put the obj_type
element at the beginning of all the structs (listener, server, proxy,
si_applet), so that the original and target pointers are always equal.
A lot of code was touched by massive replaces, but the changes are not
that important.
Before connections were introduced, it was possible to connect an
external task to a stream interface. However it was left as an
exercise for the brave implementer to find how that ought to be
done.
The feature was broken since the introduction of connections and
was never fixed since due to lack of users. Better remove this dead
code now.
Hijackers were functions designed to inject data into channels in the
distant past. They became unused around 1.3.16, and since there has
not been any user of this mechanism to date, it's uncertain whether
the mechanism still works (and it's not really useful anymore). So
better remove it as well as the pointer it uses in the channel struct.
si_fd() is not used a lot, and breaks builds on OpenBSD 5.2 which
defines this name for its own purpose. It's easy enough to remove
this one-liner function, so let's do it.
ev_sepoll already provides everything needed to manage FD events
by only manipulating the speculative I/O list. Nothing there is
sepoll-specific so move all this to fd.
At the moment sepoll is not 100% event-driven, because a call to fd_set()
on an event which is already being polled will not change its state.
This causes issues with OpenSSL because if some I/O processing is interrupted
after clearing the I/O event (eg: read all data from a socket, can't put it
all into the buffer), then there is no way to call the SSL_read() again once
the buffer releases some space.
The only real solution is to go 100% event-driven. The principle is to use
the spec list as an event cache and that each time an I/O event is reported
by epoll_wait(), this event is automatically scheduled for addition to the
spec list for future calls until the consumer explicitly asks for polling
or stopping.
Doing this is a bit tricky because sepoll used to provide a substantial
number of optimizations such as event merging. These optimizations have
been maintained : a dedicated update list is affected when events change,
but not the event list, so that updates may cancel themselves without any
side effect such as displacing events. A specific case was considered for
handling newly created FDs as soon as they are detected from within the
poll loop. This ensures that their read or write operation will always be
attempted as soon as possible, thus reducing the number of poll loops and
process_session wakeups. This is especially true for newly accepted fds
which immediately perform their first recv() call.
Two new flags were added to the fdtab[] struct to tag the fact that a file
descriptor already exists in the update list. One flag indicates that a
file descriptor is new and has just been created (fdtab[].new) and the other
one indicates that a file descriptor is already referenced by the update list
(fdtab[].updated). Even if the FD state changes during operations or if the
fd is closed and replaced, it's not an issue because the update flag remains
and is easily spotted during list walks. The flag must absolutely reflect the
presence of the fd in the update list in order to avoid overflowing the update
list with more events than there are distinct fds.
Note that this change also recovers the small performance loss introduced
by its connection counter-part and goes even beyond.
This is the first step of a series of changes aiming at making the
polling totally event-driven. This first change consists in only
remembering at the connection level whether an FD was enabled or not,
regardless of the fact it was being polled or cached. From now on, an
EAGAIN will always be considered as a change so that the pollers are
able to manage a cache and to flush it based on such events. One of
the noticeable effect is that conn_fd_handler() is called once more
per session (6 instead of 5 min) but other update functions are less
called.
Note that the performance loss caused by this change at the moment is
quite significant, around 2.5%, but the change is needed to have SSL
working correctly in all situations, even when data were read from the
socket and stored in the invisible cache, waiting for some room in the
channel's buffer.
Keys are copied from samples to stick_table_key. If a key is larger
than the stick_table_key, we have an overflow. In pratice it does not
happen because it requires :
1) a configuration with tune.bufsize larger than BUFSIZE (common)
2) a stick-table configured with keys strictly larger than buffers
3) extraction of data larger than BUFSIZE (eg: using payload())
Points 2 and 3 don't make any sense for a real world configuration. That
said the issue needs be fixed. The solution consists in allocating it the
same size as the global buffer size, just like the samples. This fixes the
issue.
Sample conversions rely on two alternative buffers which were previously
allocated as static bufs of size BUFSIZE. Now they're initialized to the
global buffer size. It was the same for HTTP authentication. Note that it
seems that none of them was prone to any mistake when dealing with the
buffer size, but better stay on the safe side by maintaining the old
assumption that a trash buffer is always "large enough".
We will need to be able to switch server connections on a session and
to keep idle connections. In order to achieve this, the preliminary
requirement is that the connections can survive the session and be
detached from them.
Right now they're still allocated at exactly the same place, so when
there is a session, there are always 2 connections. We could soon
improve on this by allocating the outgoing connection only during a
connect().
This current patch touches a lot of code and intentionally does not
change any functionnality. Performance tests show no regression (even
a very minor improvement). The doc has not yet been updated.
This commit introduces HTTP compression using the zlib library.
http_response_forward_body has been modified to call the compression
functions.
This feature includes 3 algorithms: identity, gzip and deflate:
* identity: this is mostly for debugging, and it was useful for
developping the compression feature. With Content-Length in input, it
is making each chunk with the data available in the current buffer.
With chunks in input, it is rechunking, the output chunks will be
bigger or smaller depending of the size of the input chunk and the
size of the buffer. Identity does not apply any change on data.
* gzip: same as identity, but applying a gzip compression. The data
are deflated using the Z_NO_FLUSH flag in zlib. When there is no more
data in the input buffer, it flushes the data in the output buffer
(Z_SYNC_FLUSH). At the end of data, when it receives the last chunk in
input, or when there is no more data to read, it writes the end of
data with Z_FINISH and the ending chunk.
* deflate: same as gzip, but with deflate algorithm and zlib format.
Note that this algorithm has ambiguous support on many browsers and
no support at all from recent ones. It is strongly recommended not
to use it for anything else than experimentation.
You can't choose the compression ratio at the moment, it will be set to
Z_BEST_SPEED (1), as tests have shown very little benefit in terms of
compression ration when going above for HTML contents, at the cost of
a massive CPU impact.
Compression will be activated depending of the Accept-Encoding request
header. With identity, it does not take care of that header.
To build HAProxy with zlib support, use USE_ZLIB=1 in the make
parameters.
This work was initially started by David Du Colombier at Exceliance.
Most calls to channel_forward() are performed with short byte counts and
are already optimized in channel_forward() taking just a few instructions.
Thus it's a waste of CPU cycles to call a function for this, let's just
inline the short byte count case and fall back to the common one for
remaining situations.
Doing so has increased the chunked encoding parser's performance by 12% !
ACL and sample fetches use args list and it is really not convenient to
check for null args everywhere. Now for empty args we pass a constant
list of end of lists. It will allow us to remove many useless checks.
This field was used to trace precisely where a session was terminated
but it did not survive code rearchitecture and was not used at all
anymore. Let's get rid of it.
With this commit, we now separate the channel from the buffer. This will
allow us to replace buffers on the fly without touching the channel. Since
nobody is supposed to keep a reference to a buffer anymore, doing so is not
a problem and will also permit some copy-less data manipulation.
Interestingly, these changes have shown a 2% performance increase on some
workloads, probably due to a better cache placement of data.
These two new log-format tags report the SSL protocol version (%sslv) and the
SSL ciphers (%sslc) used for the connection with the client. For instance, to
append these information just after the client's IP/port address information
on an HTTP log line, use the following configuration :
log-format %Ci:%Cp\ %sslv:%sslc\ [%t]\ %ft\ %b/%s\ %Tq/%Tw/%Tc/%Tr/%Tt\ %st\ %B\ %cc\ \ %cs\ %tsc\ %ac/%fc/%bc/%sc/%rc\ %sq/%bq\ %hr\ %hs\ %{+Q}r
It will report a line such as the following one :
Oct 12 20:47:30 haproxy[9643]: 127.0.0.1:43602 TLSv1:AES-SHA [12/Oct/2012:20:47:30.303] stick2~ stick2/s1 7/0/12/0/19 200 145 - - ---- 0/0/0/0/0 0/0 "GET /?t=0 HTTP/1.0"
When we start logging SSL information, we need the SSL struct to be
present even past the conn_xprt_close() call. In order to achieve this,
we should use refcounting on the connection and the transport layer. At
the moment it's not worth using plain refcounting as only the logs require
this, so instead of real refcounting we just use a flag which will be set
by the log subsystem when SSL data need to be logged.
What happens then is that the xprt->close() call is ignored and the
transport layer is closed again during session_free(), after the log
line is emitted.
When calling conn_xprt_close(), we always clear the transport pointer
so that all transport layers leave the connection in the same state after
a close. This will also make it safer and cheaper to call conn_xprt_close()
multiple times if needed.
Just like with the "bind" lines, we'll switch the "server" line
parsing to keyword registration. The code is essentially the same
as for bind keywords, with minor changes such as support for the
default-server keywords and support for variable argument count.
This callback sends a PROXY protocol line on the outgoing connection,
with the local and remote endpoint information. This is used for local
connections (eg: health checks) where the other end needs to have a
valid address and no connection is relayed.
It was previously in frontend.c but there is no reason for this anymore
considering that all the information involved is in the connection itself
only. Theorically this should be in the socket layer but we don't have
this yet.
We absolutely want to disable FD polling after an error is detected,
otherwise the data layer has to do it and it's far from being obvious
at these layers.
The way we did it was a bit tricky in conn_update_*_polling and
conn_*_polling_changes. However it has almost no impact on performance
and code size both for the fast and slow path.
We'll now be able to remove some flag updates in the stream interface.
The connection flags have progressively been added one after the other
and were not very well organized. Some of them are often used together
and a number of operations are performed on the DATA/SOCK ENA/POL flags.
Thus, they have been reorganized so that flags that work together are
close to each other (allows immediate operands on ARM) and that polling
changes can be detected with fewer operations using a simple shift and
xor. The handshakes are now the last ones so that it will be easier to
add new ones after without risking a collision. All activity-related
flags are also grouped together.
The generic data-layer init callback is now used after the transport
layer is complete and before calling the data layer recv/send callbacks.
This allows the session to switch from the embryonic session data layer
to the complete stream interface data layer, by making conn_session_complete()
the data layer's init callback.
It sill looks awkwards that the init() callback must be used opon error,
but except by adding yet another one, it does not seem to be mergeable
into another function (eg: it should probably not be merged with ->wake
to avoid unneeded calls during the handshake, though semantically that
would make sense).
Instead of calling conn_notify_si() from the connection handler, we
now call data->wake(), which will allow us to use a different callback
with health checks.
Note that we still rely on a flag in order to decide whether or not
to call this function. The reason is that with embryonic sessions,
the callback is already initialized to si_conn_cb without the flag,
and we can't call the SI notify function in the leave path before
the stream interface is initialized.
This issue should be addressed by involving a different data_cb for
embryonic sessions and for stream interfaces, that would be changed
during session_complete() for the final data_cb.
Now conn->data will designate the data layer which is the client for
the transport layer. In practice it's the stream interface and will
soon also be the health checks.
While working on the changes required to make the health checks use the
new connections, it started to become obvious that some naming was not
logical at all in the connections. Specifically, it is not logical to
call the "data layer" the layer which is in charge for all the handshake
and which does not yet provide a data layer once established until a
session has allocated all the required buffers.
In fact, it's more a transport layer, which makes much more sense. The
transport layer offers a medium on which data can transit, and it offers
the functions to move these data when the upper layer requests this. And
it is the upper layer which iterates over the transport layer's functions
to move data which should be called the data layer.
The use case where it's obvious is with embryonic sessions : an incoming
SSL connection is accepted. Only the connection is allocated, not the
buffers nor stream interface, etc... The connection handles the SSL
handshake by itself. Once this handshake is complete, we can't use the
data functions because the buffers and stream interface are not there
yet. Hence we have to first call a specific function to complete the
session initialization, after which we'll be able to use the data
functions. This clearly proves that SSL here is only a transport layer
and that the stream interface constitutes the data layer.
A similar change will be performed to rename app_cb => data, but the
two could not be in the same commit for obvious reasons.
It removes dependencies with futex or mutex but ssl performances decrease
using nbproc > 1 because switching process force session renegotiation.
This can be useful on small systems which never intend to run in multi-process
mode.
We don't needa to lock the memory when there is a single process. This can
make a difference on small systems where locking is much more expensive than
just a test.
This will be needed to find the stream interface from the connection
once they're detached, but in the more immediate term, we'll need this
for health checks since they don't use a stream interface.
Unix permissions are per-bind configuration line and not per listener,
so let's concretize this in the way the config is stored. This avoids
some unneeded loops to set permissions on all listeners.
The access level is not part of the unix perms so it has been moved
away. Once we can use str2listener() to set all listener addresses,
we'll have a bind keyword parser for this one.
Navigating through listeners was very inconvenient and error-prone. Not to
mention that listeners were linked in reverse order and reverted afterwards.
In order to definitely get rid of these issues, we now do the following :
- frontends have a dual-linked list of bind_conf
- frontends have a dual-linked list of listeners
- bind_conf have a dual-linked list of listeners
- listeners have a pointer to their bind_conf
This way we can now navigate from anywhere to anywhere and always find the
proper bind_conf for a given listener, as well as find the list of listeners
for a current bind_conf.
When an unknown "bind" keyword is detected, dump the list of all
registered keywords. Unsupported default alternatives are also reported
as "not supported".
Registering new SSL bind keywords was not particularly handy as it required
many #ifdef in cfgparse.c. Now the code has moved to ssl_sock.c which calls
a register function for all the keywords.
Error reporting was also improved by this move, because the called functions
build an error message using memprintf(), which can span multiple lines if
needed, and each of these errors will be displayed indented in the context of
the bind line being processed. This is important when dealing with certificate
directories which can report multiple errors.
With the arrival of SSL, the "bind" keyword has received even more options,
all of which are processed in cfgparse in a cumbersome way. So it's time to
let modules register their own bind options. This is done very similarly to
the ACLs with a small difference in that we make the difference between an
unknown option and a known, unimplemented option.
Some settings need to be merged per-bind config line and are not necessarily
SSL-specific. It becomes quite inconvenient to have this ssl_conf SSL-specific,
so let's replace it with something more generic.
A side effect of this change is that the "ssl" keyword on "bind" lines is now
just a boolean and that "crt" is needed to designate certificate files or
directories.
Note that much refcounting was needed to have the free() work correctly due to
the number of cert aliases which can make a context be shared by multiple names.
SSL config holds many parameters which are per bind line and not per
listener. Let's use a per-bind line config instead of having it
replicated for each listener.
At the moment we only do this for the SSL part but this should probably
evolved to handle more of the configuration and maybe even the state per
bind line.
This SSL session cache was developped at Exceliance and is the same that
was proposed for stunnel and stud. It makes use of a shared memory area
between the processes so that sessions can be handled by any process. It
is only useful when haproxy runs with nbproc > 1, but it does not hurt
performance at all with nbproc = 1. The aim is to totally replace OpenSSL's
internal cache.
The cache is optimized for Linux >= 2.6 and specifically for x86 platforms.
On Linux/x86, it makes use of futexes for inter-process locking, with some
x86 assembly for the locked instructions. On other architectures, GCC
builtins are used instead, which are available starting from gcc 4.1.
On other operating systems, the locks fall back to pthread mutexes so
libpthread is automatically linked. It is not recommended since pthreads
are much slower than futexes. The lib is only linked if SSL is enabled.
CVE-2009-3555 suggests that client-initiated renegociation should be
prevented in the middle of data. The workaround here consists in having
the SSL layer notify our callback about a handshake occurring, which in
turn causes the connection to be marked in the error state if it was
already considered established (which means if a previous handshake was
completed). The result is that the connection with the client is immediately
aborted and any pending data are dropped.
This data layer supports socket-to-buffer and buffer-to-socket operations.
No sock-to-pipe nor pipe-to-sock functions are provided, since splicing does
not provide any benefit with data transformation. At best it could save a
memcpy() and avoid keeping a buffer allocated but that does not seem very
useful.
An init function and a close function are provided because the SSL context
needs to be allocated/freed.
A data-layer shutw() function is also provided because upon successful
shutdown, we want to store the SSL context in the cache in order to reuse
it for future connections and avoid a new key generation.
The handshake function is directly called from the connection handler.
At this point it is not certain whether this will remain this way or
if a new ->handshake callback will be added to the data layer so that
the connection handler doesn't care about SSL.
The sock-to-buf and buf-to-sock functions are all capable of enabling
the SSL handshake at any time. This also implies polling in the opposite
direction to what was expected. The upper layers must take that into
account (it is OK right now with the stream interface).
It appears that fd.h includes a number of unneeded files and was
included from standard.h, and as such served as an intermediary
to provide almost everything to everyone.
By removing its useless includes, a long dependency chain broke
but could easily be fixed.
Polling flags were set for data and sock layer, but while this does make
sense for the ENA flag, it does not for the POL flag which translates the
detection of an EAGAIN condition. So now we remove the {DATA,SOCK}_POL*
flags and instead introduce two new layer-independant flags (WANT_RD and
WANT_WR). These flags are only set when an EAGAIN is encountered so that
polling can be enabled.
In order for these flags to have any meaning they are not persistent and
have to be cleared by the connection handler before calling the I/O and
data callbacks. For this reason, changes detection has been slightly
improved. Instead of comparing the WANT_* flags with CURR_*_POL, we only
check if the ENA status changes, or if the polling appears, since we don't
want to detect the useless poll to ena transition. Tests show that this
has eliminated one useless call to __fd_clr().
Finally the conn_set_polling() function which was becoming complex and
required complex operations from the caller was split in two and replaced
its two only callers (conn_update_data_polling and conn_update_sock_polling).
The two functions are now much smaller due to the less complex conditions.
Note that it would be possible to re-merge them and only pass a mask but
this does not appear much interesting.
The PROXY protocol is now decoded in the connection before other
handshakes. This means that it may be extracted from a TCP stream
before SSL is decoded from this stream.
When an incoming connection request is accepted, a connection
structure is needed to store its state. However we don't want to
fully initialize a session until the data layer is about to be
ready.
As long as the connection is physically stored into the session,
it's not easy to split both allocations.
As such, we only initialize the minimum requirements of a session,
which results in what we call an embryonic session. Then once the
data layer is ready, we can complete the function's initialization.
Doing so avoids buffers allocation and ensures that a session only
sees ready connections.
The frontend's client timeout is used as the handshake timeout. It
is likely that another timeout will be used in the future.
SSL need to initialize the data layer before proceeding with data. At
the moment, this data layer is automatically initialized from itself,
which will not be possible once we extract connection from sessions
since we'll only create the data layer once the handshake is finished.
So let's have the application layer initialize the data layer before
using it.
Make it more obvious that this function does not depend on any knowledge
of the session. This is important to plan for TCP rules that can run on
connection without any initialized session yet.
The last uses of the stream interfaces were in tcp_connect_server() and
could easily and more appropriately be moved to its callers, si_connect()
and connect_server(), making a lot more sense.
Now the function should theorically be usable for health checks.
It also appears more obvious that the file is split into two distinct
parts :
- the protocol layer used at the connection level
- the tcp analysers executing tcp-* rules and their samples/acls.
We need to have the source and destination addresses in the connection.
They were lying in the stream interface so let's move them. The flags
SI_FL_FROM_SET and SI_FL_TO_SET have been moved as well.
It's worth noting that tcp_connect_server() almost does not use the
stream interface anymore except for a few flags.
It has been identified that once we detach the connection from the SI,
it will probably be needed to keep a copy of the server-side addresses
in the SI just for logging purposes. This has not been implemented right
now though.
This is a massive rename of most functions which should make use of the
word "channel" instead of the word "buffer" in their names.
In concerns the following ones (new names) :
unsigned long long channel_forward(struct channel *buf, unsigned long long bytes);
static inline void channel_init(struct channel *buf)
static inline int channel_input_closed(struct channel *buf)
static inline int channel_output_closed(struct channel *buf)
static inline void channel_check_timeouts(struct channel *b)
static inline void channel_erase(struct channel *buf)
static inline void channel_shutr_now(struct channel *buf)
static inline void channel_shutw_now(struct channel *buf)
static inline void channel_abort(struct channel *buf)
static inline void channel_stop_hijacker(struct channel *buf)
static inline void channel_auto_connect(struct channel *buf)
static inline void channel_dont_connect(struct channel *buf)
static inline void channel_auto_close(struct channel *buf)
static inline void channel_dont_close(struct channel *buf)
static inline void channel_auto_read(struct channel *buf)
static inline void channel_dont_read(struct channel *buf)
unsigned long long channel_forward(struct channel *buf, unsigned long long bytes)
Some functions provided by channel.[ch] have kept their "buffer" name because
they are really designed to act on the buffer according to some information
gathered from the channel. They have been moved together to the same place in
the file for better readability but they were not changed at all.
The "buffer" memory pool was also renamed "channel".
Get rid of these confusing BF_* flags. Now channel naming should clearly
be used everywhere appropriate.
No code was changed, only a renaming was performed. The comments about
channel operations was updated.
These functions do not depend on the channel flags anymore thus they're
much better suited to be used on plain buffers. Move them from channel
to buffer.
This is similar to the recent removal of BF_OUT_EMPTY. This flag was very
problematic because it relies on permanently changing information such as the
to_forward value, so it had to be updated upon every change to the buffers.
Previous patch already got rid of its users.
One part of the change is sensible : the flag was also part of BF_MASK_STATIC,
which is used by process_session() to rescan all analysers in case the flag's
status changes. At first glance, none of the analysers seems to change its
mind base on this flag when it is subject to change, so it seems fine not to
add variation checks here. Otherwise it's possible that checking the buffer's
input and output is more reliable than checking the flag's replacement.
This flag was very problematic because it was composite in that both changes
to the pipe or to the buffer had to cause this flag to be updated, which is
not always simple (eg: there may not even be a channel attached to a buffer
at all).
There were not that many users of this flags, mostly setters. So the flag got
replaced with a macro which reports whether the channel is empty or not, by
checking both the pipe and the buffer.
One part of the change is sensible : the flag was also part of BF_MASK_STATIC,
which is used by process_session() to rescan all analysers in case the flag's
status changes. At first glance, none of the analysers seems to change its
mind base on this flag when it is subject to change, so it seems fine not to
add variation checks here. Otherwise it's possible that checking the buffer's
output size is more useful than checking the flag's replacement.
Some parts of the sock_ops structure were only used by the stream
interface and have been moved into si_ops. Some of them were callbacks
to the stream interface from the connection and have been moved into
app_cp as they're the application seen from the connection (later,
health-checks will need to use them). The rest has moved to data_ops.
Normally at this point the connection could live without knowing about
stream interfaces at all.
Similar to what was done on the receive path, the data layer now provides
only an snd_buf() callback that is iterated over by the stream interface's
si_conn_send_loop() function.
The data layer now has no knowledge about channels nor stream interfaces.
The splice() code still need to be ported as it currently is disabled.
The recv function is now generic and is usable to iterate any connection-to-buf
reading function from a stream interface. So let's move it to stream-interface.
The "raw_sock" prefix will be more convenient for naming functions as
it will be prefixed with the data layer and suffixed with the data
direction. So let's rename the files now to avoid any further confusion.
The #include directive was also removed from a number of files which do
not need it anymore.
At the moment, the struct is still embedded into the struct channel, but
all the functions have been updated to use struct buffer only when possible,
otherwise struct channel. Some functions would likely need to be splitted
between a buffer-layer primitive and a channel-layer function.
Later the buffer should become a pointer in the struct buffer, but doing so
requires a few changes to the buffer allocation calls.
This is a massive rename. We'll then split channel and buffer.
This change needs a lot of cleanups. At many locations, the parameter
or variable is still called "buf" which will become ambiguous. Also,
the "struct channel" is still defined in buffers.h.
This function is used by the data layer when a zero has been read over a
connection. At the moment it only handles sockets and nothing else. Once
the complete split is done between buffers and stream interfaces, it should
become possible to work regardless on the connection type.
The connection send() callback is supposed to be generic for a
stream-interface, and consists in calling the lower layer snd_buf
function. Move this function to the stream interface and remove
the sock-raw and sock-ssl clones.
This callback is used to send data from the buffer to the socket. It is
the old write_loop() call of the data layer which is used both by the
->write() callback and the ->chk_snd() function. The reason for having
it as a pointer is that it's the only remaining part which causes the
write and chk_snd() functions to be different between raw and ssl.
sock_raw and sock_ssl use a pretty generic chk_rcv function, so let's move
this function to the stream_interface and remove specific functions. Later
we might have a single chk_rcv function.
We need to have a generic function to be called by upper layers when buffer
flags have been updated (the si->update function). At the moment, both sock_raw
and sock_ssl had their own which basically was a copy-paste. Since these
functions are only used to update stream interface flags, it is logical to
have them handled by the stream interface code.
This allowed us to remove the stream_interface-specific update function from
sock_raw and sock_ssl which now use the generic code.
The stream_sock_update_conn callback has also been more appropriately renamed
conn_notify_si() since it's meant to be called by lower layers to notify the
SI and possibly upper layers about incoming changes.
This is a second attempt at getting rid of FD_WAIT_*. Now the situation is
much better since native I/O handlers can directly manipulate the FD using
fd_{poll|want|stop}_* and the connection handlers manipulate connection-level
flags using the conn_{data|sock}_* equivalent.
Proceeding this way ensures that the connection flags always reflect the
reality even after data<->handshake switches.
Now the connection handler, the handshake callbacks and the I/O callbacks
make use of the connection-layer polling functions to enable or disable
polling on a file descriptor.
Some changes still need to be done to avoid using the FD_WAIT_* constants.
The conflicts we're facing with polling is that handshake handlers have
precedence over data handlers and may change the polling requirements
regardless of what is expected by the data layer. This causes issues
such as missed events.
The real need is to have three polling levels :
- the "current" one, which is effective at any moment
- the data one, which reflects what the data layer asks for
- the sock one, which reflects what the socket layer asks for
Depending on whether a handshake is in progress or not, either one of the
last two will replace the current one, and the change will be propagated
to the lower layers.
At the moment, the shutdown status is not considered, and only handshakes
are used to decide which layer to chose. This will probably change.
The old EV_FD_SET() macro was confusing, as it would enable receipt but there
was no way to indicate that EAGAIN was received, hence the recently added
FD_WAIT_* flags. They're not enough as we're still facing a conflict between
EV_FD_* and FD_WAIT_*. So let's offer I/O functions what they need to explicitly
request polling.
These functions have a more explicity meaning and will offer provisions
for explicit polling.
EV_FD_ISSET() has been left for now as it is still in use in checks.
Up to now, we had to use a shutr/shutw interface per data layer, which
basically means 3 distinct functions when we include SSL :
- generic stream_interface
- sock_raw
- sock_ssl
With this change, the code located in the stream_interface manages all the
stream_interface and buffer updates, and calls the data layer hooks when
needed.
At the moment, the socket layer hook had been implicitly considered as
being a regular socket, so the si_shut*() functions call the normal
shutdown() and EV_FD_CLR() functions on the fd if a socket layer is
defined. This may change in the future. The stream_int_shut*()
functions don't call EV_FD_CLR() so that they can later be embedded
in lower layers.
Thus, the si->data->shutr() is not called anymore and si->data->shutw()
is called to close the data layer only (eg: only for SSL).
Proceeding like this is very important because it's the only way to be
able not to rely on these functions when called from the connection
handlers, and call the data layers' instead.
These primitives were initially introduced so that callers were able to
conditionally set/disable polling on a file descriptor and check in return
what the state was. It's been long since we last had an "if" on this, and
all pollers' functions were the same for cond_* and their systematic
counter parts, except that this required a check and a specific return
value that are not always necessary.
So let's simplify the FD API by removing this now unused distinction and
by making all specific functions return void.
The sock_ops I/O callbacks made use of an FD till now. This has become
inappropriate and the struct connection is much more useful. It also
fixes the race condition introduced by previous change.
The socket data layer code must only focus on moving data between a
socket and a buffer. We need a special stream interface handler to
update the stream interface and the file descriptor status.
At the moment the code works but suffers from a race condition caused
by its API : the read/write callbacks still make use of the fd instead
of using the connection. And when a double shutdown is performed, a call
to ->write() after ->read() processed an error results in dereferencing
a NULL fdtab[]->owner. This is only a temporary issue which doesn't need
to be fixed now since this will automatically go away when the functions
change to use the connection instead.
Use a single tcp_connect_probe() instead of tcp_connect_write() and
tcp_connect_read(). We call this one only when no data layer function
have been processed, so this is a fallback to test for completion of
a connection attempt.
With this done, we don't have the need for any direct I/O callback
anymore.
The function still relies on ->write() to wake the stream interface up,
so it's not finished.
This handshake handler must be independant, so move it away from
proto_tcp. It has a dedicated connection flag. It is tested before
I/O handlers and automatically removes the CO_FL_WAIT_L4_CONN flag
upon success.
It also sets the BF_WRITE_NULL flag on the stream interface and
stops the SI timeout. However it does not perform the task_wakeup(),
and relies on the data handler to do so for now. The SI wakeup will
have to be moved elsewhere anyway.
This is used to enter values for stick tables. The most likely usage
is to set gpc0 for a specific IP address in order to block traffic
for abusers without having to reload. Since all data types are
supported, other usages are possible (eg: replace a users's assigned
server).
Source addresses of non-TCP families were not correctly handled by
tcp_src_to_stktable_key() as it forgot to return NULL and instead left
the previous value in the stick-table buffer.
This bug is 1.5-specific and was introduced by commit 4f92d320 in 1.5-dev6
so it does not need any backport.
Commit 13e66da introduced b_rew() but passes -adv which is an unsigned
quantity on 64-bit platforms, causing the buffer to advance in the wrong
direction.
No backport is needed.
When the target is a client, it will be convenient to have a pointer to the
original listener so that we can retrieve some configuration information at
the stream interface level.
This function will be called later when splitting the shutdown in two
steps. It will be needed by SSL and for remote socket operations to
release unused contexts.
The state and the private pointer are not specific to the applets, since SSL
will require exactly both of them. Move them to the connection layer now and
rename them. We also now ensure that both are NULL on first call.
We start to move everything needed to manage a connection to a special
entity "struct connection". We have the data layer operations and the
control operations there. We'll also have more info in the future such
as file descriptors and applet contexts, so that in the end it becomes
detachable from the stream interface, which will allow connections to
be reused between sessions.
For now on, we start with minimal changes.
Tunnel timeouts are used when TCP connections are forwarded, or
when forwarding upgraded HTTP connections (WebSocket) as well as
CONNECT requests to proxies.
This timeout allows long-lived sessions to be supported without
having to set large timeouts to normal requests.
Similarly to the previous patch, we don't need the socket-layer functions
outside of stream_interface. They could even move to a file dedicated to
applets, though that does not seem particularly useful at the moment.
Commit e164e7a removed get_src/get_dst setting in the stream interfaces but
forgot to set it in proto_tcp. Get the feature back because we need it for
logging, transparent mode, ACLs etc... We now rely on the stream interface
direction to know what syscall to use.
One benefit of doing it this way is that we don't use getsockopt() anymore
on outgoing stream interfaces nor on UNIX sockets.
We'll soon have an SSL socket layer, and in order to ease the difference
between the two, we use the name "sock_raw" to designate the one which
directly talks to the sockets without any conversion.
There is no more reason for the realign function being HTTP specific,
it only operates on a buffer now. Let's move it to buffers.c instead.
It's likely that buffer_bounce_realign is broken (not used), this will
have to be inspected. The function is worth rewriting as it can be
cheaper than buffer_slow_realign() to realign large wrapping buffers.
All keywords registered using a cfg_kw_list now make use of the new error reporting
framework. This allows easier and more precise error reporting without having to
deal with complex buffer allocation issues.
The previous sockstream_accept() function uses nothing from sockstream, and
is totally irrelevant to stream interfaces. Move this to the protocols.c
file which handles listeners and protocols, and call it listener_accept().
It now makes much more sense that the code dealing with listen() also handles
accept() and passes it to upper layers.
This is mainly a massive renaming in the code to get it in line with the
calling convention. Next patch will rename a few files to complete this
operation.
All parsing errors were known but impossible to return. Now by making use
of memprintf(), we're able to build meaningful error messages that the
caller can display.
HTTP header fetch is now done using smp_fetch_hdr() for both ACLs and
patterns. This one also supports an occurrence number, making it possible
to specify explicit occurrences for ACLs and patterns.
Patterns were using a bitmask to indicate if request or response was desired
in fetch functions and keywords. ACLs were using a bitmask in fetch keywords
and a single bit in fetch functions. ACLs were also using an ACL_PARTIAL bit
in fetch functions indicating that a non-final fetch was performed, which was
an abuse of the existing direction flag.
The change now consists in using :
- a capabilities field for fetch keywords => SMP_CAP_REQ/RES to indicate
if a keyword supports requests, responses, both, etc...
- an option field for fetch functions to indicate what the caller expects
(request/response, final/non-final)
The ACL_PARTIAL bit was reversed to get SMP_OPT_FINAL as it's more explicit
to know we're working on a final buffer than on a non-final one.
ACL_DIR_* were removed, as well as PATTERN_FETCH_*. L4 fetches were improved
to support being called on responses too since they're still available.
The <dir> field of all fetch functions was changed to <opt> which is now
unsigned.
The patch is large but mostly made of cosmetic changes to accomodate this, as
almost no logic change happened.
Having the args everywhere will make it easier to share fetch functions
between patterns and ACLs. The only place where we could have needed
the expr was in the http_prefetch function which can do well without.
Previously, both pattern, backend and persist_rdp_cookie would build fake
ACL expressions to fetch an RDP cookie by calling acl_fetch_rdp_cookie().
Now we switch roles. The RDP cookie fetch function is provided as a sample
fetch function that all others rely on, including ACL. The code is exactly
the same, only the args handling moved from expr->args to args. The code
was moved to proto_tcp.c, but probably that a dedicated file would be more
suited to content handling.
This change is pretty minor. Struct pattern is only used for
pattern_process() now so changing it to use the common type is
quite obvious. It's worth noting that the last argument of
pattern_process() is never used so the function is self-sufficient.
Note that pattern_process() does not initialize the pattern at all
before calling fetch->process(), and that minimal initialization
will be required when we later change the argument for the sample.
These ones were either unused or improperly used. Some integers were marked
read-only, which does not make much sense. Buffers are not read-only, they're
"constant" in that they must be kept intact after any possible change.
This one is not needed anymore as we can return the data and its type in the
sample provided by the caller. ACLs now always return the proper type. BOOL
is already returned when the result is expected to be processed as a boolean.
temp_pattern has been unexported now.
This flag was used to force a boolean match even if there was no pattern
to match. It was used only by http_auth() and designed only for this one.
It's easier and cleaner to make the fetch function perform the test and
report the boolean result as a few other functions already do. It simplifies
the acl_exec_cond() logic and will help merging ACLs and patterns.
We don't need the pattern-specific args parsers anymore, make use of the
common parser instead. We still need to improve this by adding a validation
function to report abnormal argument values or combinations. We don't report
precise parsing errors yet but this was not previously done either.
arg_i was almost unused, and since we migrated to use struct arg everywhere,
the rare cases where arg_i was needed could be replaced by switching to
arg->type = ARGT_STOP.
make_arg_list() builds an array of typed arguments with their values,
that the caller describes how to parse. This will be used to support
multiple arguments for ACLs and patterns, which is currently problematic
and prevents ACLs and patterns from being merged. Up to 7 arguments types
may be enumerated in a single 32-bit word, including their number of
mandatory parts.
At the moment, these files are not used yet, they're only built. Note that
the 4-bit encoding for the type has left only one unused type!
This is more convenient and efficient than buf->p = b_ptr(buf, n);
It simply advances the buffer's pointer by <n> and trasfers that
amount of bytes from <in> to <out>. The BF_OUT_EMPTY flag is updated
accordingly.
A few occurrences of such computations in buffers.c and stream_sock.c
were updated to use b_adv(), which resulted in a small code shrink.
buffer_wrap_add was convenient for the migration but is not handy at all.
Let's have new wrappers that report input begin/end and output begin/end
instead.
It looks like we'll also need a b_adv(ofs) to advance a buffer's pointer.
buffer_ignore may only be used when the output of a buffer is empty,
but it's not granted it is always the case when sending HTTP error
responses. Better use buffer_cut_tail() instead, and use buffer_ignore
only on non-wrapping data.
The buffer's pointer <lr> was only used by HTTP parsers which also use a
struct http_msg to keep track of the parser's state. We've reached a point
where it makes no sense to keep ->lr in the buffer, as the split between
buffer and msg is only arbitrary for historical reasons.
This change ensures that touching buffers will not impact HTTP messages
anymore, making the buffers more content-agnostic. However, it becomes
very important not to forget to update msg->next when some data get
forwarded or moved (and in general each time buf->p is updated).
The new pointer in http_msg becomes relative to buffer->p so that
parsing multiple messages becomes easier. It is possible that at one
point ->som and ->next will be merged.
Note: http_parse_reqline() and http_parse_stsline() have been temporarily
modified to know the message starting point in the buffer (->p).
This change gets rid of buf->r which is always equal to buf->p + buf->i.
It removed some wrapping detection at a number of places, but required addition
of new relative offset computations at other locations. A large number of places
can be simplified now with extreme care, since most of the time, either the
pointer has to be computed once or we need a difference between the old ->w and
old ->r to compute free space. The cleanup will probably happen with the rewrite
of the buffer_input_* and buffer_output_* functions anyway.
buf->lr still has to move to the struct http_msg and be relative to buf->p
for the rework to be complete.
This change introduces the buffer's base pointer, which is the limit between
incoming and outgoing data. It's the point where the parsing should start
from. A number of computations have already been greatly simplified, but
more simplifications are expected to come from the removal of buf->r.
The changes appear good and have revealed occasional improper use of some
pointers. It is possible that this patch has introduced bugs or revealed
some, although preliminary testings tend to indicate that everything still
works as it should.
We don't have buf->l anymore. We have buf->i for pending data and
the total length is retrieved by adding buf->o. Some computation
already become simpler.
Despite extreme care, bugs are not excluded.
It's worth noting that msg->err_pos as set by HTTP request/response
analysers becomes relative to pending data and not to the beginning
of the buffer. This has not been completed yet so differences might
occur when outgoing data are left in the buffer.
These callbacks are used to retrieve the source and destination address
of a socket. The address flags are not hold on the stream interface and
not on the session anymore. The addresses are collected when needed.
This still needs to be improved to store the IP and port separately so
that it is not needed to perform a getsockname() when only the IP address
is desired for outgoing traffic.
The Unique ID, is an ID generated with several informations. You can use
a log-format string to customize it, with the "unique-id-format" keyword,
and insert it in the request header, with the "unique-id-header" keyword.
%Fi: Frontend IP
%Fp: Frontend Port
%Si: Server IP
%Sp: Server Port
%Ts: Timestamp
%rt: HTTP request counter
%H: hostname
%pid: PID
+X: Hexadecimal represenation
The +X mode in logformat displays hexadecimal for the following flags
%Ci %Cp %Fi %Fp %Bi %Bp %Si %Sp %Ts %ct %pid
rename logformat_write_string() to lf_text()
Optimize size computation
* logformat functions now take a format linked list as argument
* build_logline() build a logline using a format linked list
* rename LOG_* by LOG_FMT_* in enum
* improve error management in build_logline()
memcmp()/strcmp() calls were needed in different parts of code to determine
the status code. Each new status code introduces new calls, which can become
inefficient and source of bugs.
This patch reorganizes the code to rely on a numeric status code internally
and to be hopefully more generic.
Previously, the stats admin page required POST parameters to be provided
exactly in the same order as the HTML form.
This patch allows to handle those parameters in any orders.
Also, note that haproxy won't alter server states anymore if backend or server
names are ambiguous (duplicated names in the configuration) to prevent
unexpected results (the same should probably be applied to the stats socket).
Commit a1cc38 introduced a regression which was easy to trigger till ad4cd58
(snapshots 20120222 to 20120311 included). The bug was still present after
that but harder to trigger.
The bug is caused by the use of two distinct log buffers due to intermediary
changes. The issue happens when an HTTP request is logged just after a TCP
request during the same second and the HTTP request is too large for the buffer.
In this case, it happens that the HTTP request is logged into the TCP buffer
instead and that length controls can't detect anything.
Starting with bddd4f, the issue is still possible when logging too large an
HTTP request just after a send_log() call (typically a server status change).
We owe a big thanks to Sander Klein for testing several snapshots and more
specifically for taking significant risks in production by letting the buggy
version crash several times in order to provide an exploitable core ! The bug
could not have been found without this precious help. Thank you Sander !
This fix does not need to be backported, it did not affect any released version.
Merge http_sess_log() and tcp_sess_log() to sess_log() and move it to
log.c
A new field in logformat_type define if you can use a logformat
variable in TCP or HTTP mode.
doc: log-format in tcp mode
Note that due to the way log buffer allocation currently works, trying to
log an HTTP request without "option httplog" is still not possible. This
will change in the near future.
A number of offset computation functions use struct buffer* arguments
and return integers without modifying the input. Using consts helps
simplifying some operations in callers.
The principle behind this load balancing algorithm was first imagined
and modeled by Steen Larsen then iteratively refined through several
work sessions until it would totally address its original goal.
The purpose of this algorithm is to always use the smallest number of
servers so that extra servers can be powered off during non-intensive
hours. Additional tools may be used to do that work, possibly by
locally monitoring the servers' activity.
The first server with available connection slots receives the connection.
The servers are choosen from the lowest numeric identifier to the highest
(see server parameter "id"), which defaults to the server's position in
the farm. Once a server reaches its maxconn value, the next server is used.
It does not make sense to use this algorithm without setting maxconn. Note
that it can however make sense to use minconn so that servers are not used
at full load before starting new servers, and so that introduction of new
servers requires a progressively increasing load (the number of servers
would more or less follow the square root of the load until maxconn is
reached). This algorithm ignores the server weight, and is more beneficial
to long sessions such as RDP or IMAP than HTTP, though it can be useful
there too.
http_sess_log now use the logformat linked list to make the log
string, snprintf is not used for speed issue.
CLF mode also uses logformat.
NOTE: as of now, empty fields in CLF now are "" not "-" anymore.
parse_logformat_string: parse the string, detect the type: text,
separator or variable
parse_logformat_var: dectect variable name
parse_logformat_var_args: parse arguments and flags
add_to_logformat_list: add to the logformat linked list
send_log function is now splited in 3 functions
* hdr_log: generate the syslog header
* send_log: send a syslog message with a printf format string
* __send_log: send a syslog message
It was reported that a server configured with a zero weight would
sometimes still take connections from the backend queue. This issue is
real, it happens this way :
1) the disabled server accepts a request with a cookie
2) many cookie-less requests accumulate in the backend queue
3) when the disabled server completes its request, it checks its own
queue and the backend's queue
4) the server takes a pending request from the backend queue and
processes it. In response, the server's cookie is assigned to
the client, which ensures that some requests will continue to
be served by this server, leading back to point 1 above.
The fix consists in preventing a zero-weight server from dequeuing pending
requests from the backend. Making use of srv_is_usable() in such tests makes
the tests more robust against future changes.
This fix must be backported to 1.4 and 1.3.
In a config where server "s1" is marked disabled and "s2" tracks "s1",
s2 appears disabled on the stats but is still inserted into the LB farm
because the tracking is resolved too late in the configuration process.
We now resolve tracked servers before building LB maps and we also mark
the tracking server in maintenance mode, which previously was not done,
causing half of the issue.
Last point is that we also protect srv_is_usable() against electing a
server marked for maintenance. This is not absolutely needed but is a
safe choice and makes a lot of sense.
This fix must be backported to 1.4.
New option "http-send-name-header" specifies the name of a header which
will hold the server name in outgoing requests. This is the name of the
server the connection is really sent to, which means that upon redispatches,
the header's value is updated so that it always matches the server's name.
The new function does not return IP addresses but header values instead,
so that the caller is free to make what it want of them. The conversion
is not quite clean yet, as the previous test which considered that address
0.0.0.0 meant "no address" is still used. A different IP parsing function
should be used to take this into account.
In commit 4b517ca93a (MEDIUM: buffers:
add some new primitives and rework existing ones), we forgot to check
if buffer_max_len() < l.
No backport is needed.
A number of primitives were missing for buffer management, and some
of them were particularly awkward to use. Specifically, the functions
used to compute free space could not always be used depending what was
wrapping in the buffers. Some documentation has been added about how
the buffers work and their properties. Some functions are still missing
such as a buffer replacement which would support wrapping buffers.
It makes no sense to have one pointer to the hdr_idx pool in each proxy
struct since these pools do not depend on the proxy. Let's have a common
pool instead as it is already the case for other types.
Stream interfaces used to distinguish between client and server addresses
because they were previously of different types (sockaddr_storage for the
client, sockaddr_in for the server). This is not the case anymore, and this
distinction is confusing at best and has caused a number of regressions to
be introduced in the process of converting everything to full-ipv6. We can
now remove this and have a much cleaner code.
This patch introduces hdr_len, path_len and url_len for matching these
respective parts lengths against integers. This can be used to detect
abuse or empty headers.
We already had the ability to kill a connection, but it was only
for the checks. Now we can do this for any session, and for this we
add a specific flag "K" to the logs.
The stats socket now allows the admin to disable, enable or shutdown a frontend.
This can be used when a bug is discovered in a configuration and it's desirable
to fix it but the rules in place don't allow to change a running config. Thus it
becomes possible to kill the frontend to release the port and start a new one in
a separate process.
This can also be used to temporarily make haproxy return TCP resets to incoming
requests to pretend the service is not bound. For instance, this may be useful
to quickly flush a very deep SYN backlog.
The frontend check and lookup code was factored with the "set maxconn" usage.
This one enforces a per-process connection rate limit, regardless of what
may be set per frontend. It can be a way to limit the CPU usage of a process
being severely attacked.
The side effect is that the global process connection rate is now measured
for each incoming connection, so it will be possible to report it.
This was revealed with one of the very latest patches which caused
the listener_queue not to be initialized on the stats socket frontend.
And in fact a number of other ones were missing too. This is getting so
boring that now we'll always make use of the same function to initialize
any proxy. Doing so has even saved about 500 bytes on the binary due to
the avoided code redundancy.
No backport is needed.
This function is finally not needed anymore, as it has been replaced with
a per-proxy task that is scheduled when some limits are encountered on
incoming connections or when the process is stopping. The savings should
be noticeable on configs with a large number of proxies. The most important
point is that the rate limiting is now enforced in a clean and solid way.
When a listeners encounters a resource shortage, it currently stops until
one re-enables it. This is far from being perfect as it does not yet handle
the case where the single connection from the listener is rejected (eg: the
stats page).
Now we'll have a special status for resource limited listeners and we'll
queue them into one or multiple lists. That way, each time we have to stop
a listener because of a resource shortage, we can enqueue it and change its
state, so that it is dequeued once more resources are available.
This patch currently does not change any existing behaviour, it only adds
the basic building blocks for doing that.
Managing listeners state is difficult because they have their own state
and can at the same time have theirs dictated by their proxy. The pause
is not done properly, as the proxy code is fiddling with sockets. By
introducing new functions such as pause_listener()/resume_listener(), we
make it a bit more obvious how/when they're supposed to be used. The
listen_proxies() function was also renamed to resume_proxies() since
it's only used for pause/resume.
This patch is the first in a series aiming at getting rid of the maintain_proxies
mess. In the end, proxies should not call enable_listener()/disable_listener()
anymore.
Patch af5149 introduced an issue which can be detected only on out of
memory conditions : a LIST_DEL() may be performed on an uninitialized
struct member instead of a LIST_INIT() during the accept() phase,
causing crashes and memory corruption to occur.
This issue was detected and diagnosed by the Exceliance R&D team.
This is 1.5-specific and very recent, so no existing deployment should
be impacted.
The motivation for this is to allow iteration of all the connections
of a server without the expense of iterating over the global list
of connections.
The first use of this will be to implement an option to close connections
associated with a server when is is marked as being down or in maintenance
mode.
* The declaration of peer_session_create() does
not match its definition. As it is only
used inside of peers.c make it static.
* Make the declaration of peers_register_table()
match its definition.
* Also, make all functions in peers.c that
are not also in peers.h static
Since IPv6 is a different type than IPv4, the pattern fetch functions
src6 and dst6 were added. IPv6 stick-tables can also fetch IPv4 addresses
with src and dst. In this case, the IPv4 addresses are mapped to their
IPv6 counterpart, according to RFC 4291.
Since the latest additions to buffer_forward(), it became too large for
inlining, so let's uninline it. The code size drops by 3kB. Should be
backported to 1.4 too.
Despite much care around handling the content-length as a 64-bit integer,
forwarding was broken on 32-bit platforms due to the 32-bit nature of
the ->to_forward member of the "buffer" struct. The issue is that this
member is declared as a long, so while it works OK on 64-bit platforms,
32-bit truncate the content-length to the lower 32-bits.
One solution could consist in turning to_forward to a long long, but it
is used a lot in the critical path, so it's not acceptable to perform
all buffer size computations on 64-bit there.
The fix consists in changing the to_forward member to a strict 32-bit
integer and ensure in buffer_forward() that only the amount of bytes
that can fit into it is considered. Callers of buffer_forward() are
responsible for checking that their data were taken into account. We
arbitrarily ensure we never consider more than 2G at once.
That's the way it was intended to work on 32-bit platforms except that
it did not.
This issue was tracked down hard at Exosec with Bertrand Jacquin,
Thierry Fournier and Julien Thomas. It remained undetected for a long
time because files larger than 4G are almost always transferred in
chunked-encoded format, and most platforms dealing with huge contents
these days run on 64-bit.
The bug affects all 1.5 and 1.4 versions, and must be backported.
And also rename "req_acl_rule" "http_req_rule". At the beginning that
was a bit confusing to me, especially the "req_acl" list which in fact
holds what we call rules. After some digging, it appeared that some
part of the code is 100% HTTP and not just related to authentication
anymore, so let's move that part to HTTP and keep the auth-only code
in auth.c.
It's very annoying that frontend and backend stats are merged because we
don't know what we're observing. For instance, if a "listen" instance
makes use of a distinct backend, it's impossible to know what the bytes_out
means.
Some points take care of not updating counters twice if the backend points
to the frontend, indicating a "listen" instance. The thing becomes more
complex when we try to add support for server side keep-alive, because we
have to maintain a pointer to the backend used for last request, and to
update its stats. But we can't perform such comparisons anymore because
the counters will not match anymore.
So in order to get rid of this situation, let's have both frontend AND
backend stats in the "struct proxy". We simply update the relevant ones
during activity. Some of them are only accounted for in the backend,
while others are just for frontend. Maybe we can improve a bit on that
later, but the essential part is that those counters now reflect what
they really mean.
This patch turns internal server addresses to sockaddr_storage to
store IPv6 addresses, and makes the connect() function use it. This
code already works but some caveats with getaddrinfo/gethostbyname
still need to be sorted out while the changes had to be merged at
this stage of internal architecture changes. So for now the config
parser will not emit an IPv6 address yet so that user experience
remains unchanged.
This change should have absolutely zero user-visible effect, otherwise
it's a bug introduced during the merge, that should be reported ASAP.
This one has been removed and is now totally superseded by ->target.
To get the server, one must use target_srv(&s->target) instead of
s->srv now.
The function ensures that non-server targets still return NULL.
This is in fact where those parts belong to. The old data_state was replaced
by applet.state and is now initialized when the applet is registered. It's
worth noting that the applet does not need to know the session nor the
buffer anymore since everything is brought by the stream interface.
It is possible that having a separate applet struct would simplify the
code but that's not a big deal.
When doing a connect() on a stream interface, some information is needed
from the server and from the backend. In some situations, we don't have
a server and only a backend (eg: peers). In other cases, we know we have
an applet and we don't want to connect to anything, but we'd still like
to have the info about the applet being used.
For this, we now store a pointer to the "target" into the stream interface.
The target describes what's on the other side before trying to connect. It
can be a server, a proxy or an applet for now. Later we'll probably have
descriptors for multiple-stage chains so that the final information may
still be found.
This will help removing many specific cases in the code. It already made
it possible to remove the "srv" and "be" parameters to tcpv4_connect_server().
I/O handlers are still delicate to manipulate. They have no type, they're
just raw functions which have no knowledge of themselves. Let's have them
declared as applets once for all. That way we can have multiple applets
share the same handler functions and we can store their names there. When
we later need to add more parameters (eg: usage stats), we'll be able to
do so in the applets themselves.
The CLI functions has been prefixed with "cli" instead of "stats" as it's
clearly what is going on there.
The applet descriptor in the stream interface should get all the applet
specific data (st0, ...) but this will be done in the next patch so that
we don't pollute this one too much.
Debugging parsing errors can be greatly improved if we know what the parser
state was and what the buffer flags were (especially for closed inputs/outputs
and full buffers). Let's add that to the error snapshots.
Enhance pattern convs and fetch argument parsing, now fetchs and convs callbacks used typed args.
Add more details on error messages on parsing pattern expression function.
Update existing pattern convs and fetchs to new proto.
Create stick table key type "binary".
Manage Truncation and padding if pattern's fetch-converted result don't match table key size.
The introduction of a new PROXY protocol for proxied connections requires
an early analyser to decode the incoming connection and set the session
flags accordingly.
Some more work is needed, among which setting a flag on the session to
indicate it's proxied, and copying the original parameters for later
comparisons with new ACLs (eg: real_src, ...).
The stats web interface must be read-only by default to prevent security
holes. As it is now allowed to enable/disable servers, a new keyword
"stats admin" is introduced to activate this admin level, conditioned by ACLs.
(cherry picked from commit 5334bab92ca7debe36df69983c19c21b6dc63f78)
Based on a patch provided by Judd Montgomery, it is now possible to
enable/disable servers from the stats web interface. This allows to select
several servers in a backend and apply the action to them at the same time.
Currently, there are 2 known limitations :
- The POST data are limited to one packet
(don't alter too many servers at a time).
- Expect: 100-continue is not supported.
(cherry picked from commit 7693948766cb5647ac03b48e782cfee2b1f14491)
There was no consistency between all the functions used to exchange data
between a buffer and a stream interface. Also, the functions used to send
data to a buffer did not consider the possibility that the buffer was
shutdown for read.
Now the functions are called buffer_{put,get}_{char,block,chunk,string}.
The old buffer_feed* functions have been left available for existing code
but marked deprecated.
Signal zero is never delivered by the system. However having a signal to
which functions and tasks can subscribe to be notified of a stopping event
is useful. So this patch does two things :
1) allow signal zero to be delivered from any function of signal handler
2) make soft_stop() deliver this signal so that tasks can be notified of
a stopping condition.
The two new functions below make it possible to register any number
of functions or tasks to a system signal. They will be called in the
registration order when the signal is received.
struct sig_handler *signal_register_fct(int sig, void (*fct)(struct sig_handler *), int arg);
struct sig_handler *signal_register_task(int sig, struct task *task, int reason);
In case of HTTP keepalive processing, we want to release the counters tracked
by the backend. Till now only the second set of counters was released, while
it could have been assigned by the frontend, or the backend could also have
assigned the first set. Now we reuse to unused bits of the session flags to
mark which stick counters were assigned by the backend and to release them as
appropriate.
The assumption that there was a 1:1 relation between tracked counters and
the frontend/backend role was wrong. It is perfectly possible to track the
track-fe-counters from the backend and the track-be-counters from the
frontend. Thus, in order to reduce confusion, let's remove this useless
{fe,be} reference and simply use {1,2} instead. The keywords have also been
renamed in order to limit confusion. The ACL rule action now becomes
"track-sc{1,2}". The ACLs are now "sc{1,2}_*" instead of "trk{fe,be}_*".
That means that we can reasonably document "sc1" and "sc2" (sticky counters
1 and 2) as sort of patterns that are available during the whole session's
life and use them just like any other pattern.
Having a single tracking pointer for both frontend and backend counters
does not work. Instead let's have one for each. The keyword has changed
to "track-be-counters" and "track-fe-counters", and the ACL "trk_*"
changed to "trkfe_*" and "trkbe_*".
It's a bit cumbersome to have to know all possible storable types
from the stats interface. Instead, let's have generic types for
all data, which will facilitate their manipulation.
It is now possible to dump a table's contents with keys, expire,
use count, and various data using the command above on the stats
socket.
"show table" only shows main table stats, while "show table <name>"
dumps table contents, only if the socket level is admin.
This patch adds support for the following session counters :
- http_req_cnt : HTTP request count
- http_req_rate: HTTP request rate
- http_err_cnt : HTTP request error count
- http_err_rate: HTTP request error rate
The equivalent ACLs have been added to check the tracked counters
for the current session or the counters of the current source.
This function looks up a key, updates its expiration date, or creates
it if it was not found. acl_fetch_src_updt_conn_cnt() was updated to
make use of it.
These counters maintain incoming connection rates and session rates
in a stick-table, over a period which is defined in the configuration
(2 ms to 24 days). They can be used to detect service abuse and
enforce a certain accept rate per source address for instance, and
block if the rate is passed over.
Example :
# block if more than 50 requests per 5 seconds from a source.
stick-table type ip size 200k expire 1m store conn_rate(5s),sess_rate(5s)
tcp-request track-counters src
tcp-request reject if { trk_conn_rate gt 50 }
# cause a 3 seconds pause to requests from sources in excess of 20 requests/5s
tcp-request inspect-delay 3s
tcp-request content accept if { trk_sess_rate gt 20 } WAIT_END
We're now able to return errors based on the validity of an argument
passed to a stick-table store data type. We also support ARG_T_DELAY
to pass delays to stored data types (eg: for rate counters).
Some data types will require arguments (eg: period for a rate counter).
This patch adds support for such arguments between parenthesis in the
"store" directive of the stick-table statement. Right now only integers
are supported.
When a session tracks a counter, automatically increase the cumulated
connection count. This makes src_updt_conn_cnt() almost useless. In
fact it might still be used to update different tables.
The new "conn_cur" session counter has been added. It is automatically
updated upon "track XXX" directives, and the entry is touched at the
moment we increment the value so that we don't consider further counter
updates as real updates, otherwise we would end up updating upon completion,
which may not be desired. Probably that some other event counters (eg: HTTP
requests) will have to be updated upon each event though.
This counter can be matched against current session's source address using
the "src_conn_cur" ACL.
This patch adds the ability to set a pointer in the session to an
entry in a stick table which holds various counters related to a
specific pattern.
Right now the syntax matches the target syntax and only the "src"
pattern can be specified, to track counters related to the session's
IPv4 source address. There is a special function to extract it and
convert it to a key. But the goal is to be able to later support as
many patterns as for the stick rules, and get rid of the specific
function.
The "track-counters" directive may only be set in a "tcp-request"
statement right now. Only the first one applies. Probably that later
we'll support multi-criteria tracking for a single session and that
we'll have to name tracking pointers.
No counter is updated right now, only the refcount is. Some subsequent
patches will have to bring that feature.
The buffer_feed* functions that are used to send data to buffers did only
support sending contiguous chunks while they're relying on memcpy(). This
patch improves on this by making them able to write in two chunks if needed.
Thus, the buffer_almost_full() function has been improved to really consider
the remaining space and not just what can be written at once.
Some freq counters will have to work on periods different from 1 second.
The original freq counters rely on the period to be exactly one second.
The new ones (freq_ctr_period) let the user define the period in ticks,
and all computations are operated over that period. When reading a value,
it indicates the amount of events over that period too.
When an entry already exists, we just need to update its expiration
timer. Let's have a dedicated function for that instead of spreading
open code everywhere.
This change also ensures that an update of an existing sticky session
really leads to an update of its expiration timer, which was apparently
not the case till now. This point needs to be checked in 1.4.
Till now sticky sessions only held server IDs. Now there are other
data types so it is not acceptable anymore to overwrite the server ID
when writing something. The server ID must then only be written from
the caller when appropriate. Doing this has also led to separate
lookup and storage.
The stick_tables will now be able to store extra data for a same key.
A limited set of extra data types will be defined and for each of them
an offset in the sticky session will be assigned at startup time. All
of this information will be stored in the stick table.
The extra data types will have to be specified after the new "store"
keyword of the "stick-table" directive, which will reserve some space
for them.
pattern.c depended on stick_table while in fact it should be the opposite.
So we move from pattern.c everything related to stick_tables and invert the
dependency. That way the code becomes more logical and intuitive.
Right now we're only able to store a server ID in a sticky session.
The goal is to be able to store anything whose size is known at startup
time. For this, we store the extra data before the stksess pointer,
using a negative offset. It will then be easy to cumulate multiple
data provided they each have their own offset.
Now we're able to reject connections very early, so we need to use a
different counter for the connections that are received and the ones
that are accepted and converted into sessions, so that the rate limits
can still apply to the accepted ones. The session rate must still be
used to compute the rate limit, so that we can reject undesired traffic
without affecting the rate.
A new function session_accept() is now called from the lower layer to
instanciate a new session. Once the session is instanciated, the upper
layer's frontent_accept() is called. This one can be service-dependant.
That way, we have a 3-phase accept() sequence :
1) protocol-specific, session-less accept(), which is pointed to by
the listener. It defaults to the generic stream_sock_accept().
2) session_accept() which relies on a frontend but not necessarily
for use in a proxy (eg: stats or any future service).
3) frontend_accept() which performs the accept for the service
offerred by the frontend. It defaults to frontend_accept() which
is really what is used by a proxy.
The TCP/HTTP proxies have been moved to this mode so that we can now rely on
frontend_accept() for any type of session initialization relying on a frontend.
The next step will be to convert the stats to use the same system for the stats.
It's not normal to initialize the server-side stream interface from the
accept() function, because it may change later. Thus, we introduce a new
stream_sock_prepare_interface() function which is called just before the
connect() and which sets all of the stream_interface's callbacks to the
default ones used for real sockets. The ->connect function is also set
at the same instant so that we can easily add new server-side protocols
soon.
The new LI_O_TCP_RULES listener option indicates that some TCP rules
must be checked upon accept on this listener. It is now checked by
the frontend and the L4 rules are evaluated only in this case. The
flag is only set when at least one tcp-req rule is present in the
frontend.
The L4 rules check function has now been moved to proto_tcp.c where
it ought to be.
For a long time we had two large accept() functions, one for TCP
sockets instanciating proxies, and another one for UNIX sockets
instanciating the stats interface.
A lot of code was duplicated and both did not work exactly the same way.
Now we have a stream_sock layer accept() called for either TCP or UNIX
sockets, and this function calls the frontend-specific accept() function
which does the rest of the frontend-specific initialisation.
Some code is still duplicated (session & task allocation, stream interface
initialization), and might benefit from having an intermediate session-level
accept() callback to perform such initializations. Still there are some
minor differences that need to be addressed first. For instance, the monitor
nets should only be checked for proxies and not for other connection templates.
Last, we renamed l->private as l->frontend. The "private" pointer in
the listener is only used to store a frontend, so let's rename it to
eliminate this ambiguity. When we later support detached listeners
(eg: FTP), we'll add another field to avoid the confusion.
The 'client.c' file now only contained frontend-specific functions,
so it has naturally be renamed 'frontend.c'. Same for client.h. This
has also been an opportunity to remove some cross references from
files that should not have depended on it.
In the end, this file should contain a protocol-agnostic accept()
code, which would initialize a session, task, etc... based on an
accept() from a lower layer. Right now there are still references
to TCP.
Some functions which act on generic buffer contents without being
tcp-specific were historically in proto_tcp.c. This concerns ACLs
and RDP cookies. Those have been moved away to more appropriate
locations. Ideally we should create some new files for each layer6
protocol parser. Let's do that later.
Using get_ip_from_hdr2() we can look for occurrence #X or #-X and
extract the IP it contains. This is typically designed for use with
the X-Forwarded-For header.
Using "usesrc hdr_ip(name,occ)", it becomes possible to use the IP address
found in <name>, and possibly specify occurrence number <occ>, as the
source to connect to a server. This is possible both in a server and in
a backend's source statement. This is typically used to use the source
IP previously set by a upstream proxy.
The transparent proxy address selection was set in the TCP connect function
which is not the most appropriate place since this function has limited
access to the amount of parameters which could produce a source address.
Instead, now we determine the source address in backend.c:connect_server(),
right after calling assign_server_address() and we assign this address in
the session and pass it to the TCP connect function. This cannot be performed
in assign_server_address() itself because in some cases (transparent mode,
dispatch mode or http_proxy mode), we assign the address somewhere else.
This change will open the ability to bind to addresses extracted from many
other criteria (eg: from a header).
The following patch fixed an issue but brought another one :
296897 [MEDIUM] connect to servers even when the input has already been closed
The new issue is that when a connection is inspected and aborted using
TCP inspect rules, now it is sent to the server before being closed. So
that test is not satisfying. A probably better way is not to prevent a
connection from establishing if only BF_SHUTW_NOW is set but BF_SHUTW
is not. That way, the BF_SHUTW flag is not set if the request has any
data pending, which still fixes the stats issue, but does not let any
empty connection pass through.
Also, as a safety measure, we extend buffer_abort() to automatically
disable the BF_AUTO_CONNECT flag. While it appears to always be OK,
it is by pure luck, so better safe than sorry.
The trash buffer may now be smaller than a buffer because we can tune
it at run time. This causes a risk when we're trying to use it as a
temporary buffer to realign unaligned requests, because we may have to
put up to a full buffer into it.
Instead of doing a double copy, we're now relying on an open-coded
bouncing copy algorithm. The principle is that we move one byte at
a time to its final place, and if that place also holds a byte, then
we move it too, and so on. We finish when we've moved all the buffer.
It limits the number of memory accesses, but since it proceeds one
byte at a time and with random walk, it's not cache friendly and
should be slower than a double copy. However, it's only used in
extreme situations and the difference will not be noticeable.
It has been extensively tested and works reliably.
This is a first attempt to add a maintenance mode on servers, using
the stat socket (in admin level).
It can be done with the following command :
- disable server <backend>/<server>
- enable server <backend>/<server>
In this mode, no more checks will be performed on the server and it
will be marked as a special DOWN state (MAINT).
If some servers were tracking it, they'll go DOWN until the server
leaves the maintenance mode. The stats page and the CSV export also
display this special state.
This can be used to disable the server in haproxy before doing some
operations on this server itself. This is a good complement to the
"http-check disable-on-404" keyword and works in TCP mode.
Support the new syntax (http-request allow/deny/auth) in
http stats.
Now it is possible to use the same syntax is the same like in
the frontend/backend http-request access control:
acl src_nagios src 192.168.66.66
acl stats_auth_ok http_auth(L1)
stats http-request allow if src_nagios
stats http-request allow if stats_auth_ok
stats http-request auth realm LB
The old syntax is still supported, but now it is emulated
via private acls and an aditional userlist.
Add generic authentication & authorization support.
Groups are implemented as bitmaps so the count is limited to
sizeof(int)*8 == 32.
Encrypted passwords are supported with libcrypt and crypt(3), so it is
possible to use any method supported by your system. For example modern
Linux/glibc instalations support MD5/SHA-256/SHA-512 and of course classic,
DES-based encryption.
Just as for the req* rules, we can now condition rsp* rules with ACLs.
ACLs match on response, so volatile request information cannot be used.
A warning is emitted if a configuration contains such an anomaly.
From now on, if request filters have ACLs defined, these ACLs will be
evaluated to condition the filter. This will be used to conditionally
remove/rewrite headers based on ACLs.
This function automatically builds a rule, considering the if/unless
statements, and automatically updates the proxy's acl_requires, the
condition's file and line.
Calling this function after http_find_header2() automatically deletes
the current value of the header, and removes the header itself if the
value is the only one. The context is automatically adjusted for a
next call to http_find_header2() to return the next header. No other
change nor test should be made on the transient context though.
The stream_int_cond_close() function was added to preserve the
contents of the response buffer because stream_int_retnclose()
was buggy. It flushed the response instead of flushing the
request. This caused issues with pipelined redirects followed
by error messages which ate the previous response.
This might even have caused object truncation on pipelined
requests followed by an error or by a server redirection.
Now that this is fixed, simply get rid of the now useless
function.
Several HTTP analysers used to set those flags to values that
were useful but without considering the possibility that they
were not called again to clean what they did. First, replace
direct flag manipulation with more explicit macros. Second,
enforce a rule stating that any buffer which changes one of
these flags from the default must restore it after completion,
so that other analysers see correct flags.
With both this fix and the previous one about analyser bits,
we should not see any more stuck sessions.
Supported informations, available via "tr/td title":
- cap: capabilities (proxy)
- mode: one of tcp, http or health (proxy)
- id: SNMP ID (proxy, socket, server)
- IP (socket, server)
- cookie (backend, server)
The body parser will be used in close and keep-alive modes. It follows
the stream to keep in sync with both the request and the response message.
Both chunked transfer-coding and content-length are supported according to
RFC2616.
The multipart/byterange encoding has not yet been implemented and if not
seconded by any of the two other ones, will be forwarded till the close,
as requested by the specification.
Both the request and the response analysers converge into an HTTP_MSG_DONE
state where it will be possible to force a close (option forceclose) or to
restart with a fresh new transaction and maintain keep-alive.
This change is important. All tests are OK but any possible behaviour
change with "option httpclose" might find its root here.
This code really belongs to the http part since it's transaction-specific.
This will also make it easier to later reinitialize a transaction in order
to support keepalive.
We used to apply a limit to each buffer's size in order to leave
some room to rewrite headers, then we used to remove this limit
once the session switched to a data state.
Proceeding that way becomes a problem with keepalive because we
have to know when to stop reading too much data into the buffer
so that we can leave some room again to process next requests.
The principle we adopt here consists in only relying on to_forward+send_max.
Indeed, both of those data define how many bytes will leave the buffer.
So as long as their sum is larger than maxrewrite, we can safely
fill the buffers. If they are smaller, then we refrain from filling
the buffer. This means that we won't risk to fill buffers when
reading last data chunk followed by a POST request and its contents.
The only impact identified so far is that we must ensure that the
BF_FULL flag is correctly dropped when starting to forward. Right
now this is OK because nobody inflates to_forward without using
buffer_forward().
Implement decreasing health based on observing communication between
HAProxy and servers.
Changes in this version 2:
- documentation
- close race between a started check and health analysis event
- don't force fastinter if it is not set
- better names for options
- layer4 support
Changes in this version 3:
- add stats
- port to the current 1.4 tree
To sum up :
- len : it's now the max number of characters for the value, preventing
garbaged results.
- a new option "prefix" is added, this allows to use dynamic cookie
names (e.g. ASPSESSIONIDXXX).
Previously in the thread, I wanted to use the value found with
"capture cookie" but when i started to update the documentation, I
found this solution quite weird. I've made a small rework to not
depend on "capture cookie".
- There's the posssiblity to define the URL parser mode (path parameters
or query string).
We now set msg->col and msg->sov to the first byte of non-header.
They will be used later when parsing chunks. A new macro was added
to perform size additions on an http_msg in order to limit the risks
of copy-paste in the long term.
During this operation, it appeared that the http_msg struct was not
optimal on 64-bit, so it was re-ordered to fill the holes.
Right now, an HTTP server cannot track a TCP server and vice-versa.
This patch enables proxy tracking without relying on the proxy's mode
(tcp/http/health). It only requires a matching proxy name to exist. The
original function was renamed to findproxy_mode().
All files referencing the previous ebtree code were changed to point
to the new one in the ebtree directory. A makefile variable (EBTREE_DIR)
is also available to use files from another directory.
The ability to build the libebtree library temporarily remains disabled
because it can have an impact on some existing toolchains and does not
appear worth it in the medium term if we add support for multi-criteria
stickiness for instance.
The code part which waits for an HTTP response has been extracted
from the old function. We now have two analysers and the second one
may re-enable the first one when an 1xx response is encountered.
This has been tested and works.
The calls to stream_int_return() that were remaining in the wait
analyser have been converted to stream_int_retnclose().
This patch has 2 goals :
1. I wanted to test the appsession feature with a small PHP code,
using PHPSESSID. The problem is that when PHP gets an unknown session
id, it creates a new one with this ID. So, when sending an unknown
session to PHP, persistance is broken : haproxy won't see any new
cookie in the response and will never attach this session to a
specific server.
This also happens when you restart haproxy : the internal hash becomes
empty and all sessions loose their persistance (load balancing the
requests on all backend servers, creating a new session on each one).
For a user, it's like the service is unusable.
The patch modifies the code to make haproxy also learn the persistance
from the client : if no session is sent from the server, then the
session id found in the client part (using the URI or the client cookie)
is used to associated the server that gave the response.
As it's probably not a feature usable in all cases, I added an option
to enable it (by default it's disabled). The syntax of appsession becomes :
appsession <cookie> len <length> timeout <holdtime> [request-learn]
This helps haproxy repair the persistance (with the risk of losing its
session at the next request, as the user will probably not be load
balanced to the same server the first time).
2. This patch also tries to reduce the memory usage.
Here is a little example to explain the current behaviour :
- Take a Tomcat server where /session.jsp is valid.
- Send a request using a cookie with an unknown value AND a path
parameter with another unknown value :
curl -b "JSESSIONID=12345678901234567890123456789012" http://<haproxy>/session.jsp;jsessionid=00000000000000000000000000000001
(I know, it's unexpected to have a request like that on a live service)
Here, haproxy finds the URI session ID and stores it in its internal
hash (with no server associated). But it also finds the cookie session
ID and stores it again.
- As a result, session.jsp sends a new session ID also stored in the
internal hash, with a server associated.
=> For 1 request, haproxy has stored 3 entries, with only 1 which will be usable
The patch modifies the behaviour to store only 1 entry (maximum).
This alone makes a typical HTML stats dump consume 10% CPU less,
because we avoid doing complex printf calls to drop them later.
Only a few common cases have been checked, those which are very
likely to run for nothing.
It is a bit expensive and complex to use to call buffer_feed()
directly from the request parser, and there are risks that some
output messages are lost in case of buffer full. Since most of
these messages are static, let's have a state dedicated to print
these messages and store them in a specific area shared with the
stats in the session. This both reduces code size and risks of
losing output data.
Add two functions to encode input chunk replacing
non-printable, non ascii or special characters
with:
"&#%u;" - chunk_htmlencode
"<%02X>" - chunk_asciiencode
Above functions should be used when adding strings, received
from possible unsafe sources, to html stats or logs.
int get_backend_server(const char *bk_name, const char *sv_name,
struct proxy **bk, struct server **sv);
This function scans the list of backends and servers to retrieve the first
backend and the first server with the given names, and sets them in both
parameters. It returns zero if either is not found, or non-zero and sets
the ones it did not found to NULL. If a NULL pointer is passed for the
backend, only the pointer to the server will be updated.