These offsets were relative to the buffer itself. Now they're relative to
the buffer's origin (buf->p) which normally corresponds to the start of
current message.
This saves a big dependency between the HTTP message struct and the buffers.
It appeared during this change that ->col is not used anymore (it will have
to be removed). Next step is to turn ->eol and ->sol from absolute to relative.
The buffer's pointer <lr> was only used by HTTP parsers which also use a
struct http_msg to keep track of the parser's state. We've reached a point
where it makes no sense to keep ->lr in the buffer, as the split between
buffer and msg is only arbitrary for historical reasons.
This change ensures that touching buffers will not impact HTTP messages
anymore, making the buffers more content-agnostic. However, it becomes
very important not to forget to update msg->next when some data get
forwarded or moved (and in general each time buf->p is updated).
The new pointer in http_msg becomes relative to buffer->p so that
parsing multiple messages becomes easier. It is possible that at one
point ->som and ->next will be merged.
Note: http_parse_reqline() and http_parse_stsline() have been temporarily
modified to know the message starting point in the buffer (->p).
This change gets rid of buf->r which is always equal to buf->p + buf->i.
It removed some wrapping detection at a number of places, but required addition
of new relative offset computations at other locations. A large number of places
can be simplified now with extreme care, since most of the time, either the
pointer has to be computed once or we need a difference between the old ->w and
old ->r to compute free space. The cleanup will probably happen with the rewrite
of the buffer_input_* and buffer_output_* functions anyway.
buf->lr still has to move to the struct http_msg and be relative to buf->p
for the rework to be complete.
This change introduces the buffer's base pointer, which is the limit between
incoming and outgoing data. It's the point where the parsing should start
from. A number of computations have already been greatly simplified, but
more simplifications are expected to come from the removal of buf->r.
The changes appear good and have revealed occasional improper use of some
pointers. It is possible that this patch has introduced bugs or revealed
some, although preliminary testings tend to indicate that everything still
works as it should.
We don't have buf->l anymore. We have buf->i for pending data and
the total length is retrieved by adding buf->o. Some computation
already become simpler.
Despite extreme care, bugs are not excluded.
It's worth noting that msg->err_pos as set by HTTP request/response
analysers becomes relative to pending data and not to the beginning
of the buffer. This has not been completed yet so differences might
occur when outgoing data are left in the buffer.
Too many flags are stored in the transaction structure. Some flags are
clearly message-specific and exist in two versions (request and response).
Move them to a new "flags" field in the http_message struct instead.
memprintf() is just like snprintf() except that it always returns a properly
sized allocated string that the caller is responsible for freeing. NULL is
returned on serious errors. It also supports stackable calls over the same
pointer since it offers support for automatically freeing a previous one :
memprintf(&err, "invalid argument: '%s'", arg);
...
memprintf(&err, "keyword parser said: <%s>", *err);
...
memprintf(&err, "line parser said: %s\n", *err);
...
free(*err);
It's very annoying that we have to deal with the crappy size_t and with ints
at some places because these ones don't mix well. Patch 6f61b2 changed the
chunk len to int but its size remains size_t and some functions are having
trouble being used by several callers depending on the type of their arguments.
Let's turn extract_cookie_value() to int for now on, and plan a massive cleanup
later to remove all size_t.
These callbacks are used to retrieve the source and destination address
of a socket. The address flags are not hold on the stream interface and
not on the session anymore. The addresses are collected when needed.
This still needs to be improved to store the IP and port separately so
that it is not needed to perform a getsockname() when only the IP address
is desired for outgoing traffic.
The Unique ID, is an ID generated with several informations. You can use
a log-format string to customize it, with the "unique-id-format" keyword,
and insert it in the request header, with the "unique-id-header" keyword.
%Fi: Frontend IP
%Fp: Frontend Port
%Si: Server IP
%Sp: Server Port
%Ts: Timestamp
%rt: HTTP request counter
%H: hostname
%pid: PID
+X: Hexadecimal represenation
The +X mode in logformat displays hexadecimal for the following flags
%Ci %Cp %Fi %Fp %Bi %Bp %Si %Sp %Ts %ct %pid
rename logformat_write_string() to lf_text()
Optimize size computation
* logformat functions now take a format linked list as argument
* build_logline() build a logline using a format linked list
* rename LOG_* by LOG_FMT_* in enum
* improve error management in build_logline()
Sometimes it is desirable to forward a particular request to a specific
server without having to declare a dedicated backend for this server. This
can be achieved using the "use-server" rules. These rules are evaluated after
the "redirect" rules and before evaluating cookies, and they have precedence
on them. There may be as many "use-server" rules as desired. All of these
rules are evaluated in their declaration order, and the first one which
matches will assign the server.
memcmp()/strcmp() calls were needed in different parts of code to determine
the status code. Each new status code introduces new calls, which can become
inefficient and source of bugs.
This patch reorganizes the code to rely on a numeric status code internally
and to be hopefully more generic.
Previously, the stats admin page required POST parameters to be provided
exactly in the same order as the HTML form.
This patch allows to handle those parameters in any orders.
Also, note that haproxy won't alter server states anymore if backend or server
names are ambiguous (duplicated names in the configuration) to prevent
unexpected results (the same should probably be applied to the stats socket).
Commit a1cc38 introduced a regression which was easy to trigger till ad4cd58
(snapshots 20120222 to 20120311 included). The bug was still present after
that but harder to trigger.
The bug is caused by the use of two distinct log buffers due to intermediary
changes. The issue happens when an HTTP request is logged just after a TCP
request during the same second and the HTTP request is too large for the buffer.
In this case, it happens that the HTTP request is logged into the TCP buffer
instead and that length controls can't detect anything.
Starting with bddd4f, the issue is still possible when logging too large an
HTTP request just after a send_log() call (typically a server status change).
We owe a big thanks to Sander Klein for testing several snapshots and more
specifically for taking significant risks in production by letting the buggy
version crash several times in order to provide an exploitable core ! The bug
could not have been found without this precious help. Thank you Sander !
This fix does not need to be backported, it did not affect any released version.
The difference could be seen when logging a request in HTTP mode with option
tcplog, as it would keep emitting 4 chars. Better use two distinct flags to
clear the confusion.
%Bi return the backend source IP
%Bp return the backend source port
Add a function pointer in logformat_type to do additional configuration
during the log-format variable parsing.
Merge http_sess_log() and tcp_sess_log() to sess_log() and move it to
log.c
A new field in logformat_type define if you can use a logformat
variable in TCP or HTTP mode.
doc: log-format in tcp mode
Note that due to the way log buffer allocation currently works, trying to
log an HTTP request without "option httplog" is still not possible. This
will change in the near future.
A number of offset computation functions use struct buffer* arguments
and return integers without modifying the input. Using consts helps
simplifying some operations in callers.
The principle behind this load balancing algorithm was first imagined
and modeled by Steen Larsen then iteratively refined through several
work sessions until it would totally address its original goal.
The purpose of this algorithm is to always use the smallest number of
servers so that extra servers can be powered off during non-intensive
hours. Additional tools may be used to do that work, possibly by
locally monitoring the servers' activity.
The first server with available connection slots receives the connection.
The servers are choosen from the lowest numeric identifier to the highest
(see server parameter "id"), which defaults to the server's position in
the farm. Once a server reaches its maxconn value, the next server is used.
It does not make sense to use this algorithm without setting maxconn. Note
that it can however make sense to use minconn so that servers are not used
at full load before starting new servers, and so that introduction of new
servers requires a progressively increasing load (the number of servers
would more or less follow the square root of the load until maxconn is
reached). This algorithm ignores the server weight, and is more beneficial
to long sessions such as RDP or IMAP than HTTP, though it can be useful
there too.
http_sess_log now use the logformat linked list to make the log
string, snprintf is not used for speed issue.
CLF mode also uses logformat.
NOTE: as of now, empty fields in CLF now are "" not "-" anymore.
parse_logformat_string: parse the string, detect the type: text,
separator or variable
parse_logformat_var: dectect variable name
parse_logformat_var_args: parse arguments and flags
add_to_logformat_list: add to the logformat linked list
send_log function is now splited in 3 functions
* hdr_log: generate the syslog header
* send_log: send a syslog message with a printf format string
* __send_log: send a syslog message
It was reported that a server configured with a zero weight would
sometimes still take connections from the backend queue. This issue is
real, it happens this way :
1) the disabled server accepts a request with a cookie
2) many cookie-less requests accumulate in the backend queue
3) when the disabled server completes its request, it checks its own
queue and the backend's queue
4) the server takes a pending request from the backend queue and
processes it. In response, the server's cookie is assigned to
the client, which ensures that some requests will continue to
be served by this server, leading back to point 1 above.
The fix consists in preventing a zero-weight server from dequeuing pending
requests from the backend. Making use of srv_is_usable() in such tests makes
the tests more robust against future changes.
This fix must be backported to 1.4 and 1.3.
In a config where server "s1" is marked disabled and "s2" tracks "s1",
s2 appears disabled on the stats but is still inserted into the LB farm
because the tracking is resolved too late in the configuration process.
We now resolve tracked servers before building LB maps and we also mark
the tracking server in maintenance mode, which previously was not done,
causing half of the issue.
Last point is that we also protect srv_is_usable() against electing a
server marked for maintenance. This is not absolutely needed but is a
safe choice and makes a lot of sense.
This fix must be backported to 1.4.
New option "http-send-name-header" specifies the name of a header which
will hold the server name in outgoing requests. This is the name of the
server the connection is really sent to, which means that upon redispatches,
the header's value is updated so that it always matches the server's name.
The new function does not return IP addresses but header values instead,
so that the caller is free to make what it want of them. The conversion
is not quite clean yet, as the previous test which considered that address
0.0.0.0 meant "no address" is still used. A different IP parsing function
should be used to take this into account.
Till now the pattern data integer type was unsigned without any
particular reason. In order to make ACLs use it, we must switch it
to signed int instead.
(from ebtree 6.0.6)
This version is mainly aimed at clarifying the fact that the ebtree license
is LGPL. Some files used to indicate LGPL and other ones GPL, while the goal
clearly is to have it LGPL. A LICENSE file has also been added.
No code is affected, but it's better to have the local tree in sync anyway.
(cherry picked from commit 24dc7cca051f081600fe8232f33e55ed30e88425)
In commit 4b517ca93a (MEDIUM: buffers:
add some new primitives and rework existing ones), we forgot to check
if buffer_max_len() < l.
No backport is needed.
A number of primitives were missing for buffer management, and some
of them were particularly awkward to use. Specifically, the functions
used to compute free space could not always be used depending what was
wrapping in the buffers. Some documentation has been added about how
the buffers work and their properties. Some functions are still missing
such as a buffer replacement which would support wrapping buffers.
This patch settles the 2 loggers limitation.
Loggers are now stored in linked lists.
Using "global log", the global loggers list content is added at the end
of the current proxy list. Each "log" entries are added at the end of
the proxy list.
"no log" flush a logger list.
Ludovic Levesque reported and diagnosed an annoying bug. When a server is
configured to track another one and has a slowstart interval set, it's
assigned a minimal weight when the tracked server goes back up but keeps
this weight forever.
This is because the throttling during the warmup phase is only computed
in the health checking function.
After several attempts to resolve the issue, the only real solution is to
split the check processing task in two tasks, one for the checks and one
for the warmup. Each server with a slowstart setting has a warmum task
which is responsible for updating the server's weight after a down to up
transition. The task does not run in othe situations.
In the end, the fix is neither complex nor long and should be backported
to 1.4 since the issue was detected there first.
When reading the code, the "tracked" member of a server makes one
think the server is tracked while it's the opposite, it's a pointer
to the server being tracked. This is particularly true in constructs
such as :
if (srv->tracked) {
Since it's the second time I get caught misunderstanding it, let's
rename it to "track" to avoid the confusion.
For a long time, the max number of headers was taken as a part of the buffer
size. Since the header size can be configured at runtime, it does not make
much sense anymore.
Nothing was making it necessary to have a static value, so let's turn this into
a tunable with a default value of 101 which equals what was previously used.
It makes no sense to have one pointer to the hdr_idx pool in each proxy
struct since these pools do not depend on the proxy. Let's have a common
pool instead as it is already the case for other types.
By default, pipes are the default size for the system. But sometimes when
using TCP splicing, it can improve performance to increase pipe sizes,
especially if it is suspected that pipes are not filled and that many
calls to splice() are performed. This has an impact on the kernel's
memory footprint, so this must not be changed if impacts are not understood.
Struct sockaddr_storage is huge (128 bytes) and severely impacts the
cache. It also displaces other struct members, causing them to have
larger relative offsets. By moving these few occurrences to the end
of the structs which host them, we can reduce the code size by no less
than 2 kB !
Stream interfaces used to distinguish between client and server addresses
because they were previously of different types (sockaddr_storage for the
client, sockaddr_in for the server). This is not the case anymore, and this
distinction is confusing at best and has caused a number of regressions to
be introduced in the process of converting everything to full-ipv6. We can
now remove this and have a much cleaner code.
This patch introduces hdr_len, path_len and url_len for matching these
respective parts lengths against integers. This can be used to detect
abuse or empty headers.
These requests are mainly monitor requests, as well as stats requests when
the stats are processed by the frontend. Having this counter helps explain
the difference in number of sessions that is sometimes observed between a
frontend and a backend.
We now measure the work and idle times in order to report the idle
time in the stats. It's expected that we'll be able to use it at
other places later.
We already had the ability to kill a connection, but it was only
for the checks. Now we can do this for any session, and for this we
add a specific flag "K" to the logs.
The stats socket now allows the admin to disable, enable or shutdown a frontend.
This can be used when a bug is discovered in a configuration and it's desirable
to fix it but the rules in place don't allow to change a running config. Thus it
becomes possible to kill the frontend to release the port and start a new one in
a separate process.
This can also be used to temporarily make haproxy return TCP resets to incoming
requests to pretend the service is not bound. For instance, this may be useful
to quickly flush a very deep SYN backlog.
The frontend check and lookup code was factored with the "set maxconn" usage.
This one enforces a per-process connection rate limit, regardless of what
may be set per frontend. It can be a way to limit the CPU usage of a process
being severely attacked.
The side effect is that the global process connection rate is now measured
for each incoming connection, so it will be possible to report it.
This option permits to change the global maxconn setting within the
limit that was set by the initial value, which is now reported as the
hard maxconn value. This allows to immediately accept more concurrent
connections or to stop accepting new ones until the value passes below
the indicated setting.
The main use of this option is on systems where many haproxy instances
are loaded and admins need to re-adjust resource sharing at run time
to regain a bit of fairness between processes.
Trailing spaces after headers were not trimmed, only the leading ones
were. An issue was detected today with a content-length value which
was padded with spaces and which was rejected. Recent updates to the
http-bis draft made it a lot more clear that such spaces must be ignored,
so this is what this patch does.
It should be backported to 1.4.
Many inet_ntop calls were partially right, which was hard to detect given
the complex combinations. Some of them were relying on the listener's proto
instead of the address itself, which could have been different when dealing
with an accept-proxy connection.
The new addr_to_str() function does the dirty job and returns the family, which
makes it particularly suited to calls from switch/case statements. A large number
of if/else statements were removed and the stats output could even be cleaned up
in the case of session dump.
As a side effect of doing this, the resulting code is smaller by almost 1kB.
All changed parts have been tested and provided expected output.
Some older libc don't define splice() and and don't define _syscall*()
either, which causes build errors if splicing is enabled.
To solve this, we now split the syscall redefinition into two layers :
- one file per syscall (epoll, splice)
- one common file to declare the _syscall*() macros
The code is cleaner because files using the syscalls just have to include
their respective file. It's not adviced to merge multiple syscall families
into a same file if all are not intended to be used simultaneously, because
defining unused static functions causes warnings to be emitted during build.
As a result, the new USE_MY_SPLICE parameter was added in order to be able
to define the splice() syscall separately.
If "option forwardfor" has the "if-none" argument, then the header is
only added when the request did not already have one. This option has
security implications, and should not be set blindly.
Manoj Kumar reported a case where haproxy would crash upon start-up. The
cause was an "http-check expect" statement declared in the defaults section,
which caused a NULL regex to be used during the check. This statement is not
allowed in defaults sections precisely because this requires saving a copy
of the regex in the default proxy. But the check was not made to prevent it
from being declared there, hence the issue.
Instead of adding code to detect its abnormal use, we decided to implement
it. It was not that much complex because the expect_str part was not used
with regexes, so it could hold the string form of the regex in order to
compile it again for every backend (there's no way to clone regexes).
This patch has been tested and works. So it's both a bugfix and a minor
feature enhancement.
It should be backported to 1.4 though it's not critical since the config
was not supposed to be supported.
Adding health checks has become a real pain, with cross-references to all
checks everywhere because they're all a single bit. Since they're all
exclusive, let's change this to have a check number only. We reserve 4
bits allowing up to 16 checks (15+tcp), only 7 of which are currently
used. The code has shrunk by almost 1kB and we saved a few option bits.
The "dispatch" option has been moved to px->options, making a few tests
a bit cleaner.
This patch provides a new "option redis-check" statement to enable server health checks based on redis PING request (http://www.redis.io/commands/ping).
This global task is used to periodically check for end of resource shortage
and to try to enable queued listeners again. This is important in case some
temporary system-wide shortage is encountered, so that we don't have to wait
for an existing connection to be released before checking the queue again.
For situations where listeners are queued due to the global maxconn being
reached, the task is woken up at least every second. For situations where
a system resource shortage is detected (memory, sockets, ...) the task is
woken up at least every 100 ms. That way, recovery from severe events can
still be achieved under acceptable conditions.
This was revealed with one of the very latest patches which caused
the listener_queue not to be initialized on the stats socket frontend.
And in fact a number of other ones were missing too. This is getting so
boring that now we'll always make use of the same function to initialize
any proxy. Doing so has even saved about 500 bytes on the binary due to
the avoided code redundancy.
No backport is needed.
This function is finally not needed anymore, as it has been replaced with
a per-proxy task that is scheduled when some limits are encountered on
incoming connections or when the process is stopping. The savings should
be noticeable on configs with a large number of proxies. The most important
point is that the rate limiting is now enforced in a clean and solid way.
Those states have been replaced with PR_STFULL and PR_STREADY respectively,
as it is what matches them the best now. Also, two occurrences of PR_STIDLE
in peers.c have been removed as this did not provide any form of error recovery
anyway.
All listeners that are limited by a proxy-specific resource are now
queued at the proxy's and not globally. This allows finer-grained
wakeups when releasing resource.
When an accept() fails because of a connection limit or a memory shortage,
we now disable it and queue it so that it's dequeued only when a connection
is released. This has improved the behaviour of the process near the fd limit
as now a listener with a no connection (eg: stats) will not loop forever
trying to get its connection accepted.
The solution is still not 100% perfect, as we'd like to have this used when
proxy limits are reached (use a per-proxy list) and for safety, we'd need
to have dedicated tasks to periodically re-enable them (eg: to overcome
temporary system-wide resource limitations when no connection is released).
When a listeners encounters a resource shortage, it currently stops until
one re-enables it. This is far from being perfect as it does not yet handle
the case where the single connection from the listener is rejected (eg: the
stats page).
Now we'll have a special status for resource limited listeners and we'll
queue them into one or multiple lists. That way, each time we have to stop
a listener because of a resource shortage, we can enqueue it and change its
state, so that it is dequeued once more resources are available.
This patch currently does not change any existing behaviour, it only adds
the basic building blocks for doing that.
Managing listeners state is difficult because they have their own state
and can at the same time have theirs dictated by their proxy. The pause
is not done properly, as the proxy code is fiddling with sockets. By
introducing new functions such as pause_listener()/resume_listener(), we
make it a bit more obvious how/when they're supposed to be used. The
listen_proxies() function was also renamed to resume_proxies() since
it's only used for pause/resume.
This patch is the first in a series aiming at getting rid of the maintain_proxies
mess. In the end, proxies should not call enable_listener()/disable_listener()
anymore.
Patch af5149 introduced an issue which can be detected only on out of
memory conditions : a LIST_DEL() may be performed on an uninitialized
struct member instead of a LIST_INIT() during the accept() phase,
causing crashes and memory corruption to occur.
This issue was detected and diagnosed by the Exceliance R&D team.
This is 1.5-specific and very recent, so no existing deployment should
be impacted.
Never add connections allocated to this sever to a stick-table.
This may be used in conjunction with backup to ensure that
stick-table persistence is disabled for backup servers.
apsession_refresh() and apsess_refressh are only used inside apsession.c
and thus can be made static.
The only use of apsession_refresh() is appsession_task_init().
These functions have been re-ordered to avoid the need for
a forward-declaration of apsession_refresh().
If a connection is closed by because the backend became unavailable
then log 'D' as the termination condition.
Signed-off-by: Simon Horman <horms@verge.net.au>
This adds the "on-marked-down shutdown-sessions" statement on "server" lines,
which causes all sessions established on a server to be killed at once when
the server goes down. The task's priority is reniced to the highest value
(1024) so that servers holding many tasks don't cause a massive slowdown due
to the wakeup storm.
The motivation for this is to allow iteration of all the connections
of a server without the expense of iterating over the global list
of connections.
The first use of this will be to implement an option to close connections
associated with a server when is is marked as being down or in maintenance
mode.
* The declaration of peer_session_create() does
not match its definition. As it is only
used inside of peers.c make it static.
* Make the declaration of peers_register_table()
match its definition.
* Also, make all functions in peers.c that
are not also in peers.h static
Bashkim Kasa reported that the stats admin page did not work when colons
were used in server or backend names. This was caused by url-encoding
resulting in ':' being sent as '%3A'. Now we systematically decode the
field names and values to fix this issue.
It's more expensive to call splice() on short payloads than to use
recv()+send(). One of the reasons is that doing a splice() involves
allocating a pipe. One other reason is that the kernel will have to
copy itself if we try to splice less than a page. So let's fix a
short offset of 4kB below which we don't splice.
A quick test shows that on chunked encoded data, with splice we had
6826 syscalls (1715 splice, 3461 recv, 1650 send) while with this
patch, the same transfer resulted in 5793 syscalls (3896 recv, 1897
send).
There are some very rare server-to-server applications that abuse the HTTP
protocol and expect the payload phase to be highly interactive, with many
interleaved data chunks in both directions within a single request. This is
absolutely not supported by the HTTP specification and will not work across
most proxies or servers. When such applications attempt to do this through
haproxy, it works but they will experience high delays due to the network
optimizations which favor performance by instructing the system to wait for
enough data to be available in order to only send full packets. Typical
delays are around 200 ms per round trip. Note that this only happens with
abnormal uses. Normal uses such as CONNECT requests nor WebSockets are not
affected.
When "option http-no-delay" is present in either the frontend or the backend
used by a connection, all such optimizations will be disabled in order to
make the exchanges as fast as possible. Of course this offers no guarantee on
the functionality, as it may break at any other place. But if it works via
HAProxy, it will work as fast as possible. This option should never be used
by default, and should never be used at all unless such a buggy application
is discovered. The impact of using this option is an increase of bandwidth
usage and CPU usage, which may significantly lower performance in high
latency environments.
This change should be backported to 1.4 since the first report of such a
misuse was in 1.4. Next patch will also be needed.
This status code is used in response to requests matching "monitor-uri".
Some users need to adjust it to fit their needs (eg: make some strings
appear there). As it's already defined as a chunked string and used
exactly like other status codes, it makes sense to make it configurable
with the usual "errorfile", "errorloc", ...
John Helliwell reported a runtime issue on Solaris since 1.5-dev5. Traces
show that connect() returns EINVAL, which means the socket length is not
appropriate for the family. Solaris does not like being called with sizeof
and needs the address family's size on sockaddr_storage.
The fix consists in adding a get_addr_len() function which returns the
socket's address length based on its family. Tests show that this works
for both IPv4 and IPv6 addresses.
Since IPv6 is a different type than IPv4, the pattern fetch functions
src6 and dst6 were added. IPv6 stick-tables can also fetch IPv4 addresses
with src and dst. In this case, the IPv4 addresses are mapped to their
IPv6 counterpart, according to RFC 4291.
Since the latest additions to buffer_forward(), it became too large for
inlining, so let's uninline it. The code size drops by 3kB. Should be
backported to 1.4 too.
Despite much care around handling the content-length as a 64-bit integer,
forwarding was broken on 32-bit platforms due to the 32-bit nature of
the ->to_forward member of the "buffer" struct. The issue is that this
member is declared as a long, so while it works OK on 64-bit platforms,
32-bit truncate the content-length to the lower 32-bits.
One solution could consist in turning to_forward to a long long, but it
is used a lot in the critical path, so it's not acceptable to perform
all buffer size computations on 64-bit there.
The fix consists in changing the to_forward member to a strict 32-bit
integer and ensure in buffer_forward() that only the amount of bytes
that can fit into it is considered. Callers of buffer_forward() are
responsible for checking that their data were taken into account. We
arbitrarily ensure we never consider more than 2G at once.
That's the way it was intended to work on 32-bit platforms except that
it did not.
This issue was tracked down hard at Exosec with Bertrand Jacquin,
Thierry Fournier and Julien Thomas. It remained undetected for a long
time because files larger than 4G are almost always transferred in
chunked-encoded format, and most platforms dealing with huge contents
these days run on 64-bit.
The bug affects all 1.5 and 1.4 versions, and must be backported.
The parser now distinguishes between pure addresses and address:port. This
is useful for some config items where only an address is required.
Raw IPv6 addresses are now parsed, but IPv6 host name resolution is still not
handled (gethostbyname does not resolve IPv6 names to addresses).
This option enables use of the PROXY protocol with the server, which
allows haproxy to transport original client's address across multiple
architecture layers.
Upon connection establishment, stream_sock is now able to send a PROXY
line before sending any data. Since it's possible that the buffer is
already full, and we don't want to allocate a block for that line, we
compute it on-the-fly when we need it. We just store the offset from
which to (re-)send from the end of the line, since it's assumed that
multiple outputs of the same proxy line will be strictly equivalent. In
practice, one call is enough. We just make sure to handle the case where
the first send() would indicate an incomplete output, eventhough it's
very unlikely to ever happen.
And also rename "req_acl_rule" "http_req_rule". At the beginning that
was a bit confusing to me, especially the "req_acl" list which in fact
holds what we call rules. After some digging, it appeared that some
part of the code is 100% HTTP and not just related to authentication
anymore, so let's move that part to HTTP and keep the auth-only code
in auth.c.
It's very annoying that frontend and backend stats are merged because we
don't know what we're observing. For instance, if a "listen" instance
makes use of a distinct backend, it's impossible to know what the bytes_out
means.
Some points take care of not updating counters twice if the backend points
to the frontend, indicating a "listen" instance. The thing becomes more
complex when we try to add support for server side keep-alive, because we
have to maintain a pointer to the backend used for last request, and to
update its stats. But we can't perform such comparisons anymore because
the counters will not match anymore.
So in order to get rid of this situation, let's have both frontend AND
backend stats in the "struct proxy". We simply update the relevant ones
during activity. Some of them are only accounted for in the backend,
while others are just for frontend. Maybe we can improve a bit on that
later, but the essential part is that those counters now reflect what
they really mean.
This patch turns internal server addresses to sockaddr_storage to
store IPv6 addresses, and makes the connect() function use it. This
code already works but some caveats with getaddrinfo/gethostbyname
still need to be sorted out while the changes had to be merged at
this stage of internal architecture changes. So for now the config
parser will not emit an IPv6 address yet so that user experience
remains unchanged.
This change should have absolutely zero user-visible effect, otherwise
it's a bug introduced during the merge, that should be reported ASAP.
This one has been removed and is now totally superseded by ->target.
To get the server, one must use target_srv(&s->target) instead of
s->srv now.
The function ensures that non-server targets still return NULL.
s->prev_srv is used by assign_server() only, but all code paths leading
to it now take s->prev_srv from the existing s->srv. So assign_server()
can do that copy into its own stack.
If at one point a different srv is needed, we still have a copy of the
last server on which we failed a connection attempt in s->target.
When dealing with HTTP keep-alive, we'll have to know if we can reuse
an existing connection. For that, we'll have to check if the current
connection was made on the exact same target (referenced in the stream
interface).
Thus, we need to first assign the next target to the session, then
copy it to the stream interface upon connect(). Later we'll check for
equivalence between those two operations.
Till now we used the fact that the dispatch address was not null to use
the dispatch mode. This is very unconvenient, so let's have a dedicated
option.
This is in fact where those parts belong to. The old data_state was replaced
by applet.state and is now initialized when the applet is registered. It's
worth noting that the applet does not need to know the session nor the
buffer anymore since everything is brought by the stream interface.
It is possible that having a separate applet struct would simplify the
code but that's not a big deal.
Now that we have the target pointer and type in the stream interface,
we don't need the applet.handler pointer anymore. That makes the code
somewhat cleaner because we know we're dealing with an applet by checking
its type instead of checking the pointer is not null.
When doing a connect() on a stream interface, some information is needed
from the server and from the backend. In some situations, we don't have
a server and only a backend (eg: peers). In other cases, we know we have
an applet and we don't want to connect to anything, but we'd still like
to have the info about the applet being used.
For this, we now store a pointer to the "target" into the stream interface.
The target describes what's on the other side before trying to connect. It
can be a server, a proxy or an applet for now. Later we'll probably have
descriptors for multiple-stage chains so that the final information may
still be found.
This will help removing many specific cases in the code. It already made
it possible to remove the "srv" and "be" parameters to tcpv4_connect_server().
Those 3 parts are the buffer side, the remote side and the communication
functions. This change has no functional effect but is needed to proceed
further.
I/O handlers are still delicate to manipulate. They have no type, they're
just raw functions which have no knowledge of themselves. Let's have them
declared as applets once for all. That way we can have multiple applets
share the same handler functions and we can store their names there. When
we later need to add more parameters (eg: usage stats), we'll be able to
do so in the applets themselves.
The CLI functions has been prefixed with "cli" instead of "stats" as it's
clearly what is going on there.
The applet descriptor in the stream interface should get all the applet
specific data (st0, ...) but this will be done in the next patch so that
we don't pollute this one too much.
Till now, the forwarding code was making use of the hdr_content_len member
to hold the size of the last chunk parsed. As such, it was reset after being
scheduled for forwarding. The issue is that this entry was reset before the
data could be viewed by backend.c in order to parse a POST body, so the
"balance url_param check_post" did not work anymore.
In order to fix this, we need two things :
- the chunk size (reset upon every forward)
- the total body size (not reset)
hdr_content_len was thus replaced by the former (hence the size of the patch)
as it makes more sense to have it stored that way than the way around.
This patch should be backported to 1.4 with care, considering that it affects
the forwarding code.
I have written a small patch to enable a correct PostgreSQL health check
It works similar to mysql-check with the very same parameters.
E.g.:
listen pgsql 127.0.0.1:5432
mode tcp
option pgsql-check user pgsql
server masterdb pgsql.server.com:5432 check inter 10000
One of the requirements we have is to run multiple instances of haproxy on a
single host; this is so that we can split the responsibilities (and change
permissions) between product teams. An issue we ran up against is how we
would distinguish between the logs generated by each instance. The solution
we came up with (please let me know if there is a better way) is to override
the application tag written to syslog. We can then configure syslog to write
these to different files.
I have attached a patch adding a global option 'log-tag' to override the
default syslog tag 'haproxy' (actually defaults to argv[0]).
Haproxy does not include the hostname rather the IP of the machine in
the syslog headers it sends. Unfortunately this means that for each log
line rsyslog does a reverse dns on the client IP and in the case of
non-routable IPs one gets the public hostname not the internal one.
While this is valid according to RFC3164 as one might imagine this is
troublsome if you have some machines with public IPs, internal IPs, no
reverse DNS entries, etc and you want a standardized hostname based log
directory structure. The rfc says the preferred value is the hostname.
This patch adds a global "log-send-hostname" statement which accepts an
optional string to force the host name. If unset, the local host name
is used.
HTTP pipelining currently needs to monitor the response buffer to wait
for some free space to be able to send a response. It was not possible
for the HTTP analyser to be called based on response buffer activity.
Now we introduce a new buffer flag BF_WAKE_ONCE which is set when the
HTTP request analyser is set on the response buffer and some activity
is detected. This is not clean at all but once of the only ways to fix
the issue before we make it possible to register events for analysers.
Also it appeared that one realign condition did not cover all cases.
This counter will help quickly spot whether there are new errors or not.
It is also assigned to each capture so that a script can keep trace of
which capture was taken when.
Debugging parsing errors can be greatly improved if we know what the parser
state was and what the buffer flags were (especially for closed inputs/outputs
and full buffers). Let's add that to the error snapshots.
When the number of servers is a multiple of the size of the input set,
map-based hash can be inefficient. This typically happens with 64
servers when doing URI hashing. The "avalanche" hash-type applies an
avalanche hash before performing a map lookup in order to smooth the
distribution. The result is slightly less smooth than the map for small
numbers of servers, but still better than the consistent hashing.
We'll use this hash at other places, let's make it globally available.
The function has also been renamed because its "chash_hash" name was
not appropriate.
Ross West reported that int32_t breaks compilation on FreeBSD. Since an
int is 32-bit on all supported platforms and we already rely on that,
change the type.
Enhance pattern convs and fetch argument parsing, now fetchs and convs callbacks used typed args.
Add more details on error messages on parsing pattern expression function.
Update existing pattern convs and fetchs to new proto.
Create stick table key type "binary".
Manage Truncation and padding if pattern's fetch-converted result don't match table key size.
MAXPATHLEN may be used at other places, it's unconvenient to have it
redefined in a few files. Also, since checking it requires including
sys/param.h, some versions of it cause a macro declaration conflict
with MIN/MAX which are defined in tools.h. The solution consists in
including sys/param.h in both files so that we ensure it's loaded
before the macros are defined and MAXPATHLEN is checked.
The introduction of a new PROXY protocol for proxied connections requires
an early analyser to decode the incoming connection and set the session
flags accordingly.
Some more work is needed, among which setting a flag on the session to
indicate it's proxied, and copying the original parameters for later
comparisons with new ACLs (eg: real_src, ...).
inetaddr_host_lim_ret() used to make use of const char** for some
args, but that make it impossible ot use char** due to the way
controls are made by gcc. So let's change that.
This option makes haproxy preserve any persistence cookie emitted by
the server, which allows the server to change it or to unset it, for
instance, after a logout request.
(cherry picked from commit 52e6d75374c7900c1fe691c5633b4ae029cae8d5)
The stats web interface must be read-only by default to prevent security
holes. As it is now allowed to enable/disable servers, a new keyword
"stats admin" is introduced to activate this admin level, conditioned by ACLs.
(cherry picked from commit 5334bab92ca7debe36df69983c19c21b6dc63f78)
Based on a patch provided by Judd Montgomery, it is now possible to
enable/disable servers from the stats web interface. This allows to select
several servers in a backend and apply the action to them at the same time.
Currently, there are 2 known limitations :
- The POST data are limited to one packet
(don't alter too many servers at a time).
- Expect: 100-continue is not supported.
(cherry picked from commit 7693948766cb5647ac03b48e782cfee2b1f14491)
If a cookie comes in with a first or last date, and they are configured on
the backend, they're checked. If a date is expired or too far in the future,
then the cookie is ignored and the specific reason appears in the cookie
field of the logs.
(cherry picked from commit faa3019107eabe6b3ab76ffec9754f2f31aa24c6)
These functions only require 5 chars to encode 30 bits, and don't expect
any padding. They will be used to encode dates in cookies.
(cherry picked from commit a7e2b5fc4612994c7b13bcb103a4a2c3ecd6438a)
The set-cookie status flags were not very handy and limited. Reorder
them to save some room for additional values and add the "U" flags
(for Updated expiration date) that will be used with expirable cookies
in insert mode.
(cherry picked from commit 5bab52f821bb0fa99fc48ad1b400769e66196ece)
We'll need one more bit to store and report the request cookie's status.
Doing this required moving a few bits around. However, now in 1.4 all bits
are used, there's no room left.
Cookie flags will need
(cherry picked from commit 09ebca0413c43620ddc375b5b4ab31a25d47b3f4)
In all cookie persistence modes but prefix, we now support cookies whose
value is suffixed with some contents after a vertical bar ('|'). This will
be used to pass an optional expiration date. So as of now we only consider
the part of the cookie value which is used before the vertical bar.
(cherry picked from commit a4486bf4e5b03b5a980d03fef799f6407b2c992d)
Add two new arguments to the "cookie" keyword, to be able to
fix a max idle and max life on them. Right now only the parameter
parsing is implemented.
(cherry picked from commit 9ad5dec4c3bb8f29129f292cb22d3fc495fcc98a)
HTTP content-based health checks will be involved in searching text in pages.
Some pages may not fit in the default buffer (16kB) and sometimes it might be
desired to have larger buffers in order to find patterns. Running checks on
smaller URIs is always preferred of course.
(cherry picked from commit 043f44aeb835f3d0b57626c4276581a73600b6b1)
This patch adds the "http-check expect [r]{string,status}" statements
which enable health checks based on whether the response status or body
to an HTTP request contains a string or matches a regex.
This probably is one of the oldest patches that remained unmerged. Over
the time, several people have contributed to it, among which FinalBSD
(first and second implementations), Nick Chalk (port to 1.4), Anze
Skerlavaj (tests and fixes), Cyril Bont (general fixes), and of course
myself for the final fixes and doc during integration.
Some people already use an old version of this patch which has several
issues, among which the inability to search for a plain string that is
not at the beginning of the data, and the inability to look for response
contents that are provided in a second and subsequent recv() calls. But
since some configs are already deployed, it was quite important to ensure
a 100% compatible behaviour on the working cases.
Thus, that patch fixes the issues while maintaining config compatibility
with already deployed versions.
(cherry picked from commit b507c43a3ce9a8e8e4b770e52e4edc20cba4c37f)
This patch provides a new "option ldap-check" statement to enable
server health checks based on LDAPv3 bind requests.
(cherry picked from commit b76b44c6fed8a7ba6f0f565dd72a9cb77aaeca7c)
There was no consistency between all the functions used to exchange data
between a buffer and a stream interface. Also, the functions used to send
data to a buffer did not consider the possibility that the buffer was
shutdown for read.
Now the functions are called buffer_{put,get}_{char,block,chunk,string}.
The old buffer_feed* functions have been left available for existing code
but marked deprecated.
This counter is incremented for each incoming connection and each active
listener, and is used to prevent haproxy from stopping upon SIGUSR1. It
will thus be possible for some tasks in increment this counter in order
to prevent haproxy from dying until they have completed their job.
Signal zero is never delivered by the system. However having a signal to
which functions and tasks can subscribe to be notified of a stopping event
is useful. So this patch does two things :
1) allow signal zero to be delivered from any function of signal handler
2) make soft_stop() deliver this signal so that tasks can be notified of
a stopping condition.
The two new functions below make it possible to register any number
of functions or tasks to a system signal. They will be called in the
registration order when the signal is received.
struct sig_handler *signal_register_fct(int sig, void (*fct)(struct sig_handler *), int arg);
struct sig_handler *signal_register_task(int sig, struct task *task, int reason);
In case of binding failure during startup, we wait for some time sending
signals to old pids so that they release the ports we need. But if there
aren't any old pids anymore, it's useless to wait, we prefer to fail fast.
Along with this change, we now have the number of old pids really found
in the nb_oldpids variable.
In case of HTTP keepalive processing, we want to release the counters tracked
by the backend. Till now only the second set of counters was released, while
it could have been assigned by the frontend, or the backend could also have
assigned the first set. Now we reuse to unused bits of the session flags to
mark which stick counters were assigned by the backend and to release them as
appropriate.
The assumption that there was a 1:1 relation between tracked counters and
the frontend/backend role was wrong. It is perfectly possible to track the
track-fe-counters from the backend and the track-be-counters from the
frontend. Thus, in order to reduce confusion, let's remove this useless
{fe,be} reference and simply use {1,2} instead. The keywords have also been
renamed in order to limit confusion. The ACL rule action now becomes
"track-sc{1,2}". The ACLs are now "sc{1,2}_*" instead of "trk{fe,be}_*".
That means that we can reasonably document "sc1" and "sc2" (sticky counters
1 and 2) as sort of patterns that are available during the whole session's
life and use them just like any other pattern.
Having a single tracking pointer for both frontend and backend counters
does not work. Instead let's have one for each. The keyword has changed
to "track-be-counters" and "track-fe-counters", and the ACL "trk_*"
changed to "trkfe_*" and "trkbe_*".
It is now possible to dump some select table entries based on criteria
which apply to the stored data. This is enabled by appending the following
options to the end of the "show table" statement :
data.<data_type> {eq|ne|lt|gt|le|ge} <value>
For intance :
show table http_proxy data.conn_rate gt 5
show table http_proxy data.gpc0 ne 0
The compare applies to the integer value as it would be displayed, and
operates on signed long long integers.
It's a bit cumbersome to have to know all possible storable types
from the stats interface. Instead, let's have generic types for
all data, which will facilitate their manipulation.
It is now possible to dump a table's contents with keys, expire,
use count, and various data using the command above on the stats
socket.
"show table" only shows main table stats, while "show table <name>"
dumps table contents, only if the socket level is admin.
This patch adds support for the following session counters :
- http_req_cnt : HTTP request count
- http_req_rate: HTTP request rate
- http_err_cnt : HTTP request error count
- http_err_rate: HTTP request error rate
The equivalent ACLs have been added to check the tracked counters
for the current session or the counters of the current source.
This counter may be used to track anything. Two sets of ACLs are available
to manage it, one gets its value, and the other one increments its value
and returns it. In the second case, the entry is created if it did not
exist.
Thus it is possible for example to mark a source as being an abuser and
to keep it marked as long as it does not wait for the entry to expire :
# The rules below use gpc0 to track abusers, and reject them if
# a source has been marked as such. The track-counters statement
# automatically refreshes the entry which will not expire until a
# 1-minute silence is respected from the source. The second rule
# evaluates the second part if the first one is true, so GPC0 will
# be increased once the conn_rate is above 100/5s.
stick-table type ip size 200k expire 1m store conn_rate(5s),gpc0
tcp-request track-counters src
tcp-request reject if { trk_get_gpc0 gt 0 }
tcp-request reject if { trk_conn_rate gt 100 } { trk_inc_gpc0 gt 0}
Alternatively, it is possible to let the entry expire even in presence of
traffic by swapping the check for gpc0 and the track-counters statement :
stick-table type ip size 200k expire 1m store conn_rate(5s),gpc0
tcp-request reject if { src_get_gpc0 gt 0 }
tcp-request track-counters src
tcp-request reject if { trk_conn_rate gt 100 } { trk_inc_gpc0 gt 0}
It is also possible not to track counters at all, but entry lookups will
then be performed more often :
stick-table type ip size 200k expire 1m store conn_rate(5s),gpc0
tcp-request reject if { src_get_gpc0 gt 0 }
tcp-request reject if { src_conn_rate gt 100 } { src_inc_gpc0 gt 0}
The '0' at the end of the counter name is there because if we find that more
counters may be useful, other ones will be added.
This function looks up a key, updates its expiration date, or creates
it if it was not found. acl_fetch_src_updt_conn_cnt() was updated to
make use of it.
These counters maintain incoming and outgoing byte rates in a stick-table,
over a period which is defined in the configuration (2 ms to 24 days).
They can be used to detect service abuse and enforce a certain bandwidth
limits per source address for instance, and block if the rate is passed
over. Since 32-bit counters are used to compute the rates, it is important
not to use too long periods so that we don't have to deal with rates above
4 GB per period.
Example :
# block if more than 5 Megs retrieved in 30 seconds from a source.
stick-table type ip size 200k expire 1m store bytes_out_rate(30s)
tcp-request track-counters src
tcp-request reject if { trk_bytes_out_rate gt 5000000 }
# cause a 15 seconds pause to requests from sources in excess of 2 megs/30s
tcp-request inspect-delay 15s
tcp-request content accept if { trk_bytes_out_rate gt 2000000 } WAIT_END
These counters maintain incoming connection rates and session rates
in a stick-table, over a period which is defined in the configuration
(2 ms to 24 days). They can be used to detect service abuse and
enforce a certain accept rate per source address for instance, and
block if the rate is passed over.
Example :
# block if more than 50 requests per 5 seconds from a source.
stick-table type ip size 200k expire 1m store conn_rate(5s),sess_rate(5s)
tcp-request track-counters src
tcp-request reject if { trk_conn_rate gt 50 }
# cause a 3 seconds pause to requests from sources in excess of 20 requests/5s
tcp-request inspect-delay 3s
tcp-request content accept if { trk_sess_rate gt 20 } WAIT_END
We're now able to return errors based on the validity of an argument
passed to a stick-table store data type. We also support ARG_T_DELAY
to pass delays to stored data types (eg: for rate counters).
Some data types will require arguments (eg: period for a rate counter).
This patch adds support for such arguments between parenthesis in the
"store" directive of the stick-table statement. Right now only integers
are supported.
When a session tracks a counter, automatically increase the cumulated
connection count. This makes src_updt_conn_cnt() almost useless. In
fact it might still be used to update different tables.
The new "bytes_in_cnt" and "bytes_out_cnt" session counters have been
added. They're automatically updated when session counters are updated.
They can be matched with the "src_kbytes_in" and "src_kbytes_out" ACLs
which apply to the volume per source address. This can be used to deny
access to service abusers.
The new "conn_cur" session counter has been added. It is automatically
updated upon "track XXX" directives, and the entry is touched at the
moment we increment the value so that we don't consider further counter
updates as real updates, otherwise we would end up updating upon completion,
which may not be desired. Probably that some other event counters (eg: HTTP
requests) will have to be updated upon each event though.
This counter can be matched against current session's source address using
the "src_conn_cur" ACL.
The "_cnt" suffix is already used by ACLs to count various data,
so it makes sense to use the same one in "conn_cnt" instead of
"conn_cum" to count cumulated connections.
This is not a problem because no version was emitted with those
keywords.
Thus we'll try to stick to the following rules :
xxxx_cnt : cumulated event count for criterion xxxx
xxxx_cur : current number of concurrent entries for criterion xxxx
xxxx_rate: event rate for criterion xxxx
This patch adds the ability to set a pointer in the session to an
entry in a stick table which holds various counters related to a
specific pattern.
Right now the syntax matches the target syntax and only the "src"
pattern can be specified, to track counters related to the session's
IPv4 source address. There is a special function to extract it and
convert it to a key. But the goal is to be able to later support as
many patterns as for the stick rules, and get rid of the specific
function.
The "track-counters" directive may only be set in a "tcp-request"
statement right now. Only the first one applies. Probably that later
we'll support multi-criteria tracking for a single session and that
we'll have to name tracking pointers.
No counter is updated right now, only the refcount is. Some subsequent
patches will have to bring that feature.
The buffer_feed* functions that are used to send data to buffers did only
support sending contiguous chunks while they're relying on memcpy(). This
patch improves on this by making them able to write in two chunks if needed.
Thus, the buffer_almost_full() function has been improved to really consider
the remaining space and not just what can be written at once.
Sometimes it's necessary to be able to perform some "layer 6" analysis
in the backend. TCP request rules were not available till now, although
documented in the diagram. Enable them in backend now.
Some config parsing functions need to return composite status codes
when they rely on other functions. Let's provide a few such codes
for general use and extend them later.
Some freq counters will have to work on periods different from 1 second.
The original freq counters rely on the period to be exactly one second.
The new ones (freq_ctr_period) let the user define the period in ticks,
and all computations are operated over that period. When reading a value,
it indicates the amount of events over that period too.
We'll need to divide 64 bits by 32 bits with new frequency counters.
Gcc does not know when it can safely do that, but the way we build
our operations let us be sure. So let's provide an optimised version
for that purpose.
This member will be used later when frontends are created on the
fly by some tasks. It will also be usable later if we need to
support multiple config instances for example.
When a connection is closed on a stream interface, some iohandlers
will need to be informed in order to release some resources. This
normally happens upon a shutr+shutw. It is the equivalent of the
fd_delete() call which is done for real sockets, except that this
time we release internal resources.
It can also be used with real sockets because it does not cost
anything else and might one day be useful.
The quote_arg() function can be used to quote an argument or indicate
"end of line" if it's null or empty. It should be useful to more precisely
report location of problems in the configuration.
When an entry already exists, we just need to update its expiration
timer. Let's have a dedicated function for that instead of spreading
open code everywhere.
This change also ensures that an update of an existing sticky session
really leads to an update of its expiration timer, which was apparently
not the case till now. This point needs to be checked in 1.4.
Till now sticky sessions only held server IDs. Now there are other
data types so it is not acceptable anymore to overwrite the server ID
when writing something. The server ID must then only be written from
the caller when appropriate. Doing this has also led to separate
lookup and storage.
This one can be parsed on the "stick-table" after with the "store"
keyword. It will hold the number of connections matching the entry,
for use with ACLs or anything else.
The stick_tables will now be able to store extra data for a same key.
A limited set of extra data types will be defined and for each of them
an offset in the sticky session will be assigned at startup time. All
of this information will be stored in the stick table.
The extra data types will have to be specified after the new "store"
keyword of the "stick-table" directive, which will reserve some space
for them.
pattern.c depended on stick_table while in fact it should be the opposite.
So we move from pattern.c everything related to stick_tables and invert the
dependency. That way the code becomes more logical and intuitive.
The name 'exps' and 'keys' in struct stksess was confusing because it was
the same name as in the table which holds all of them, while they only hold
one node each. Remove the trailing 's' to more clearly identify who's who.
Right now we're only able to store a server ID in a sticky session.
The goal is to be able to store anything whose size is known at startup
time. For this, we store the extra data before the stksess pointer,
using a negative offset. It will then be easy to cumulate multiple
data provided they each have their own offset.
It's very disturbing to see the "denied req" counter increase without
any other session counter moving. In fact, we can't count a rejected
TCP connection as "denied req" as we have not yet instanciated any
session at all. Let's use a new counter for that.
Now we're able to reject connections very early, so we need to use a
different counter for the connections that are received and the ones
that are accepted and converted into sessions, so that the rate limits
can still apply to the accepted ones. The session rate must still be
used to compute the rate limit, so that we can reject undesired traffic
without affecting the rate.
Analysers don't care (and must not care) about a few flags such as
BF_AUTO_CLOSE or BF_AUTO_CONNECT, so those flags should not be listed
in the BF_MASK_STATIC bitmask.
We should also recheck if some buffer flags should be ignored or not
in process_session() when deciding if we must loop again or not.
A new function session_accept() is now called from the lower layer to
instanciate a new session. Once the session is instanciated, the upper
layer's frontent_accept() is called. This one can be service-dependant.
That way, we have a 3-phase accept() sequence :
1) protocol-specific, session-less accept(), which is pointed to by
the listener. It defaults to the generic stream_sock_accept().
2) session_accept() which relies on a frontend but not necessarily
for use in a proxy (eg: stats or any future service).
3) frontend_accept() which performs the accept for the service
offerred by the frontend. It defaults to frontend_accept() which
is really what is used by a proxy.
The TCP/HTTP proxies have been moved to this mode so that we can now rely on
frontend_accept() for any type of session initialization relying on a frontend.
The next step will be to convert the stats to use the same system for the stats.
The conn_retries still lies in the session and its initialization depends
on the backend when it may not yet be known. Let's first move it to the
stream interface.
It's not normal to initialize the server-side stream interface from the
accept() function, because it may change later. Thus, we introduce a new
stream_sock_prepare_interface() function which is called just before the
connect() and which sets all of the stream_interface's callbacks to the
default ones used for real sockets. The ->connect function is also set
at the same instant so that we can easily add new server-side protocols
soon.
The connection timeout stored in the buffer has not been used since the
stream interface were introduced. Let's get rid of it as it's one of the
things that complicate factoring of the accept() functions.
We can disable the monitor-net rules on a listener if this flag is not
set in the listener's options. This will be useful when we don't want
to check that fe->addr is set or not for non-TCP frontends.
The new LI_O_TCP_RULES listener option indicates that some TCP rules
must be checked upon accept on this listener. It is now checked by
the frontend and the L4 rules are evaluated only in this case. The
flag is only set when at least one tcp-req rule is present in the
frontend.
The L4 rules check function has now been moved to proto_tcp.c where
it ought to be.
For a long time we had two large accept() functions, one for TCP
sockets instanciating proxies, and another one for UNIX sockets
instanciating the stats interface.
A lot of code was duplicated and both did not work exactly the same way.
Now we have a stream_sock layer accept() called for either TCP or UNIX
sockets, and this function calls the frontend-specific accept() function
which does the rest of the frontend-specific initialisation.
Some code is still duplicated (session & task allocation, stream interface
initialization), and might benefit from having an intermediate session-level
accept() callback to perform such initializations. Still there are some
minor differences that need to be addressed first. For instance, the monitor
nets should only be checked for proxies and not for other connection templates.
Last, we renamed l->private as l->frontend. The "private" pointer in
the listener is only used to store a frontend, so let's rename it to
eliminate this ambiguity. When we later support detached listeners
(eg: FTP), we'll add another field to avoid the confusion.
The 'client.c' file now only contained frontend-specific functions,
so it has naturally be renamed 'frontend.c'. Same for client.h. This
has also been an opportunity to remove some cross references from
files that should not have depended on it.
In the end, this file should contain a protocol-agnostic accept()
code, which would initialize a session, task, etc... based on an
accept() from a lower layer. Right now there are still references
to TCP.
Some functions which act on generic buffer contents without being
tcp-specific were historically in proto_tcp.c. This concerns ACLs
and RDP cookies. Those have been moved away to more appropriate
locations. Ideally we should create some new files for each layer6
protocol parser. Let's do that later.
Just like we do on health checks, we should consider that ACLs that make
use of buffer data are layer 6 and not layer 4, because we'll soon have
to distinguish between pure layer 4 ACLs (without any buffer) and these
ones.
This ACL was missing in complex setups where the status of a remote site
has to be considered in switching decisions. Until there, using a server's
status in an ACL required to have a dedicated backend, which is a bit heavy
when multiple servers have to be monitored.
The code is now ready to support loading pattern from filesinto trees. For
that, it will be required that the ACL keyword has a flag ACL_MAY_LOOKUP
and that the expr is case sensitive. When that is true, the pattern will
have a flag ACL_PAT_F_TREE_OK to indicate that it is possible to feed the
tree instead of a usual pattern if the parsing function is able to do this.
The tree's root is pre-initialized in the pattern's value so that the
function can easily find it. At that point, if the parsing function decides
to use the tree, it just sets ACL_PAT_F_TREE in the return flags so that
the caller knows the tree has been used and the pattern can be recycled.
That way it will be possible to load some patterns into the tree when it
is compatible, and other ones as linear linked lists. A good example of
this might be IPv4 network entries : right now we support holes in masks,
but this very rare feature is not compatible with binary lookup in trees.
So the parser will be able to decide itself whether the pattern can go to
the tree or not.
If we want to be able to match ACLs against a lot of possible values, we
need to put those values in trees. That will only work for exact matches,
which is normally just what is needed.
Right now, only IPv4 and string matching are planned, but others might come
later.
This is used to disable persistence depending on some conditions (for
example using an ACL matching static files or a specific User-Agent).
You can see it as a complement to "force-persist".
In the configuration file, the force-persist/ignore-persist declaration
order define the rules priority.
Used with the "appsesion" keyword, it can also help reducing memory usage,
as the session won't be hashed the persistence is ignored.
Some servers do not completely conform with RFC2616 requirements for
keep-alive when they receive a request with "Connection: close". More
specifically, they don't bother using chunked encoding, so the client
never knows whether the response is complete or not. One immediately
visible effect is that haproxy cannot maintain client connections alive.
The second issue is that truncated responses may be cached on clients
in case of network error or timeout.
scar Fras Barranco reported this issue on Tomcat 6.0.20, and
Patrik Nilsson with Jetty 6.1.21.
Cyril Bont proposed this smart idea of pretending we run keep-alive
with the server and closing it at the last moment as is already done
with option forceclose. The advantage is that we only change one
emitted header but not the overall behaviour.
Since some servers such as nginx are able to close the connection
very quickly and save network packets when they're aware of the
close negociation in advance, we don't enable this behaviour by
default.
"option http-pretend-keepalive" will have to be used for that, in
conjunction with "option http-server-close".
Using get_ip_from_hdr2() we can look for occurrence #X or #-X and
extract the IP it contains. This is typically designed for use with
the X-Forwarded-For header.
Using "usesrc hdr_ip(name,occ)", it becomes possible to use the IP address
found in <name>, and possibly specify occurrence number <occ>, as the
source to connect to a server. This is possible both in a server and in
a backend's source statement. This is typically used to use the source
IP previously set by a upstream proxy.
The transparent proxy address selection was set in the TCP connect function
which is not the most appropriate place since this function has limited
access to the amount of parameters which could produce a source address.
Instead, now we determine the source address in backend.c:connect_server(),
right after calling assign_server_address() and we assign this address in
the session and pass it to the TCP connect function. This cannot be performed
in assign_server_address() itself because in some cases (transparent mode,
dispatch mode or http_proxy mode), we assign the address somewhere else.
This change will open the ability to bind to addresses extracted from many
other criteria (eg: from a header).
We'll need another flag in the 'options' member close to PR_O_TPXY_*,
and all are used, so let's move this easy one to options2 (which are
already used for SQL checks).
The following patch fixed an issue but brought another one :
296897 [MEDIUM] connect to servers even when the input has already been closed
The new issue is that when a connection is inspected and aborted using
TCP inspect rules, now it is sent to the server before being closed. So
that test is not satisfying. A probably better way is not to prevent a
connection from establishing if only BF_SHUTW_NOW is set but BF_SHUTW
is not. That way, the BF_SHUTW flag is not set if the request has any
data pending, which still fixes the stats issue, but does not let any
empty connection pass through.
Also, as a safety measure, we extend buffer_abort() to automatically
disable the BF_AUTO_CONNECT flag. While it appears to always be OK,
it is by pure luck, so better safe than sorry.
We are seeing both real servers repeatedly going on- and off-line with
a period of tens of seconds. Packet tracing, stracing, and adding
debug code to HAProxy itself has revealed that the real servers are
always responding correctly, but HAProxy is sometimes receiving only
part of the response.
It appears that the real servers are sending the test page as three
separate packets. HAProxy receives the contents of one, two, or three
packets, apparently randomly. Naturally, the health check only
succeeds when all three packets' data are seen by HAProxy. If HAProxy
and the real servers are modified to use a plain HTML page for the
health check, the response is in the form of a single packet and the
checks do not fail.
(...)
I've added buffer and length variables to struct server, and allocated
space with the rest of the server initialisation.
(...)
It seems to be working fine in my tests, and handles check responses
that are bigger than the buffer.
today I've noticed that the stats page still displays v1.3 in the
"Updates" link, due to the PRODUCT_BRANCH value in version.h, then
it's maybe time to send you the result (notice that the patch updates
PRODUCT_BRANCH to "1.4").
--
Cyril Bont
When trying to spot some complex bugs, it's often needed to access
information on stuck sessions, which is quite difficult. This new
command helps one get detailed information about a session, with
flags, timers, states, etc... The buffer data are not dumped yet.
Often we need to understand why some transfers were aborted or what
constitutes server response errors. With those two counters, it is
now possible to detect an unexpected transfer abort during a data
phase (eg: too short HTTP response), and to know what part of the
server response errors may in fact be assigned to aborted transfers.
The bounce realign function was algorithmically good but as expected
it was not cache-friendly. Using it with large requests caused so many
cache thrashing that the function itself could drain 70% of the total
CPU time for only 0.5% of the calls !
Revert back to a standard memcpy() using a specially allocated swap
buffer. We're now back to 2M req/s on pipelined requests.
It is wrong to merge FE and BE stats for a proxy because when we consult a
BE's stats, it reflects the FE's stats eventhough the BE has received no
traffic. The most common example happens with listen instances, where the
backend gets credited for all the trafic even when a use_backend rule makes
use of another backend.
The trash buffer may now be smaller than a buffer because we can tune
it at run time. This causes a risk when we're trying to use it as a
temporary buffer to realign unaligned requests, because we may have to
put up to a full buffer into it.
Instead of doing a double copy, we're now relying on an open-coded
bouncing copy algorithm. The principle is that we move one byte at
a time to its final place, and if that place also holds a byte, then
we move it too, and so on. We finish when we've moved all the buffer.
It limits the number of memory accesses, but since it proceeds one
byte at a time and with random walk, it's not cache friendly and
should be slower than a double copy. However, it's only used in
extreme situations and the difference will not be noticeable.
It has been extensively tested and works reliably.
This is a first attempt to add a maintenance mode on servers, using
the stat socket (in admin level).
It can be done with the following command :
- disable server <backend>/<server>
- enable server <backend>/<server>
In this mode, no more checks will be performed on the server and it
will be marked as a special DOWN state (MAINT).
If some servers were tracking it, they'll go DOWN until the server
leaves the maintenance mode. The stats page and the CSV export also
display this special state.
This can be used to disable the server in haproxy before doing some
operations on this server itself. This is a good complement to the
"http-check disable-on-404" keyword and works in TCP mode.
Support the new syntax (http-request allow/deny/auth) in
http stats.
Now it is possible to use the same syntax is the same like in
the frontend/backend http-request access control:
acl src_nagios src 192.168.66.66
acl stats_auth_ok http_auth(L1)
stats http-request allow if src_nagios
stats http-request allow if stats_auth_ok
stats http-request auth realm LB
The old syntax is still supported, but now it is emulated
via private acls and an aditional userlist.
Add generic authentication & authorization support.
Groups are implemented as bitmaps so the count is limited to
sizeof(int)*8 == 32.
Encrypted passwords are supported with libcrypt and crypt(3), so it is
possible to use any method supported by your system. For example modern
Linux/glibc instalations support MD5/SHA-256/SHA-512 and of course classic,
DES-based encryption.
Implement Base64 decoding with a reverse table.
The function accepts and decodes classic base64 strings, which
can be composed from many streams as long each one is properly
padded, for example: SGVsbG8=IEhBUHJveHk=IQ==
Just as for the req* rules, we can now condition rsp* rules with ACLs.
ACLs match on response, so volatile request information cannot be used.
A warning is emitted if a configuration contains such an anomaly.
From now on, if request filters have ACLs defined, these ACLs will be
evaluated to condition the filter. This will be used to conditionally
remove/rewrite headers based on ACLs.
This function automatically builds a rule, considering the if/unless
statements, and automatically updates the proxy's acl_requires, the
condition's file and line.
Now a server can check the contents of the header X-Haproxy-Server-State
to know how haproxy sees it. The same values as those reported in the stats
are provided :
- up/down status + check counts
- throttle
- weight vs backend weight
- active sessions vs backend sessions
- queue length
- haproxy node name
Currently we cannot easily add headers nor anything to HTTP checks
because the requests are pre-formatted with the last CRLF. Make the
check code add the CRLF itself so that we can later add useful info.
Some converters will need one or several arguments. It's not possible
to write a simple generic parser for that, so let's add the ability
for each converter to support its own argument parser, and call it
to get the arguments when it's specified. If unspecified, the arguments
are passed unmodified as string+len.
The pattern type converters currently support a string arg and a length.
Sometimes we'll prefer to pass them a list or a structure. So let's convert
the string and length into a generic void* and int that each converter may
use as it likes.
Despite what is explicitly stated in HTTP specifications,
browsers still use the undocumented Proxy-Connection header
instead of the Connection header when they connect through
a proxy. As such, proxies generally implement support for
this stupid header name, breaking the standards and making
it harder to support keep-alive between clients and proxies.
Thus, we add a new "option http-use-proxy-header" to tell
haproxy that if it sees requests which look like proxy
requests, it should use the Proxy-Connection header instead
of the Connection header.
This is used to force access to down servers for some requests. This
is useful when validating that a change on a server correctly works
before enabling the server again.
Sometimes we need to be able to change the default kernel socket
buffer size (recv and send). Four new global settings have been
added for this :
- tune.rcvbuf.client
- tune.rcvbuf.server
- tune.sndbuf.client
- tune.sndbuf.server
Those can be used to reduce kernel memory footprint with large numbers
of concurrent connections, and to reduce risks of write timeouts with
very slow clients due to excessive kernel buffering.
We need to improve Connection header handling in the request for it
to support the upcoming keep-alive mode. Now we have two flags which
keep in the session the information about the presence of a
Connection: close and a Connection: keep-alive headers in the initial
request, as well as two others which keep the current state of those
headers so that we don't have to parse them again. Knowing the initial
value is essential to know when the client asked for keep-alive while
we're forcing a close (eg in server-close mode). Also the Connection
request parser is now able to automatically remove single header values
at the same time they are parsed. This provides greater flexibility and
reliability.
All combinations of listen/front/back in all modes and with both
1.0 and 1.1 have been tested.
Some header values might be delimited with spaces, so it's not enough to
compare "close" or "keep-alive" with strncasecmp(). Use word_match() for
that.
Calling this function after http_find_header2() automatically deletes
the current value of the header, and removes the header itself if the
value is the only one. The context is automatically adjusted for a
next call to http_find_header2() to return the next header. No other
change nor test should be made on the transient context though.
While waiting in a keep-alive state for a request, we want to silently
close if we don't get anything. However if we get a partial request it's
different because that means the client has started to send something.
This requires a new transaction flag. It will be used to implement a
distinct timeout for keep-alive and requests.
This change, suggested by Cyril Bont, makes a lot of sense and
would have made it obvious that sessid was not properly initialized
while switching to keep-alive. The code is now cleaner.
The stream_int_cond_close() function was added to preserve the
contents of the response buffer because stream_int_retnclose()
was buggy. It flushed the response instead of flushing the
request. This caused issues with pipelined redirects followed
by error messages which ate the previous response.
This might even have caused object truncation on pipelined
requests followed by an error or by a server redirection.
Now that this is fixed, simply get rid of the now useless
function.
Sometimes it can be desired to return a location which is the same
as the request with a slash appended when there was not one in the
request. A typical use of this is for sending a 301 so that people
don't reference links without the trailing slash. The name of the
new option is "append-slash" and it can be used on "redirect"
statements in prefix mode.
Some message pointers were not usable once the message reached the
HTTP_MSG_DONE state. This is the case for ->som which points to the
body because it is needed to parse chunks. There is one case where
we need the beginning of the message : server redirect. We have to
call http_get_path() after the request has been parsed. So we rely
on ->sol without counting on ->som. In order to achieve this, we're
making ->rq.{u,v} relative to the beginning of the message instead
of the buffer. That simplifies the code and makes it cleaner.
Preliminary tests show this is OK.
Several HTTP analysers used to set those flags to values that
were useful but without considering the possibility that they
were not called again to clean what they did. First, replace
direct flag manipulation with more explicit macros. Second,
enforce a rule stating that any buffer which changes one of
these flags from the default must restore it after completion,
so that other analysers see correct flags.
With both this fix and the previous one about analyser bits,
we should not see any more stuck sessions.
This patch implements default-server support allowing to change
default server options. It can be used in [defaults] or [backend]/[listen]
sections. Currently the following options are supported:
- error-limit
- fall
- inter
- fastinter
- downinter
- maxconn
- maxqueue
- minconn
- on-error
- port
- rise
- slowstart
- weight
Supported informations, available via "tr/td title":
- cap: capabilities (proxy)
- mode: one of tcp, http or health (proxy)
- id: SNMP ID (proxy, socket, server)
- IP (socket, server)
- cookie (backend, server)
There were still several situations leading to CLOSE_WAIT sockets
remaining there forever because some complex transitions were
obviously not caught due to the impossibility to resync changes
between the request and response FSMs.
This patch now centralizes the global transaction state and feeds
it from both request and response transitions. That way, whoever
finishes first, there will be no issue for converging to the correct
state.
Some heavy use of the new debugging function has helped a lot. Maybe
those calls could be removed after some time. First tests are very
positive.
By default we automatically wait for enough data to fill large
packets if buf->to_forward is not null. This causes a problem
with POST/Expect requests which have a data size but no data
immediately available. Instead of causing noticeable delays on
such requests, simply add a flag to disable waiting when sending
requests.
Many times we see a lot of short responses in HTTP (typically 304 on a
reload). It is a waste of network bandwidth to send that many small packets
when we know we can merge them. When we know that another HTTP request is
following a response, we set BF_EXPECT_MORE on the response buffer, which
will turn MSG_MORE on exactly once. That way, multiple short responses can
leave pipelined if their corresponding requests were also pipelined.
This option enables HTTP keep-alive on the client side and close mode
on the server side. This offers the best latency on the slow client
side, and still saves as many resources as possible on the server side
by actively closing connections. Pipelining is supported on both requests
and responses, though there is currently no reason to get pipelined
responses.
This new flag may be set by any user on a stream interface to tell
the underlying protocol that there is no need for lingering on the
socket since we know the other side either received everything or
does not care about what we sent.
This will typically be used with forced server close in HTTP mode,
where we want to quickly close a server connection after receiving
its response. Otherwise the system would prevent us from reusing
the same port for some time.
The body parser will be used in close and keep-alive modes. It follows
the stream to keep in sync with both the request and the response message.
Both chunked transfer-coding and content-length are supported according to
RFC2616.
The multipart/byterange encoding has not yet been implemented and if not
seconded by any of the two other ones, will be forwarded till the close,
as requested by the specification.
Both the request and the response analysers converge into an HTTP_MSG_DONE
state where it will be possible to force a close (option forceclose) or to
restart with a fresh new transaction and maintain keep-alive.
This change is important. All tests are OK but any possible behaviour
change with "option httpclose" might find its root here.
Some wrong operations were performed on buffers, assuming the
offsets were relative to the beginning of the request while they
are relative to the beginning of the buffer. In practice this is
not yet an issue since both are the same... until we add support
for keep-alive.
It's not enough to know if the connection will be in CLOSE or TUNNEL mode,
we still need to know whether we want to read a full message to a known
length or read it till the end just as in TUNNEL mode. Some updates to the
RFC clarify slightly better the corner cases, in particular for the case
where a non-chunked encoding is used last.
Now we also take care of adding a proper "connection: close" to messages
whose size could not be determined.
This state indicates that an HTTP message (request or response) is
complete. This will be used to know when we can re-initialize a
new transaction. Right now we only switch to it after the end of
headers if there is no data. When other analysers are implemented,
we can switch to this state too.
The condition to reuse a connection is when the response finishes
after the request. This will have to be checked when setting the
state.
This code really belongs to the http part since it's transaction-specific.
This will also make it easier to later reinitialize a transaction in order
to support keepalive.
We used to apply a limit to each buffer's size in order to leave
some room to rewrite headers, then we used to remove this limit
once the session switched to a data state.
Proceeding that way becomes a problem with keepalive because we
have to know when to stop reading too much data into the buffer
so that we can leave some room again to process next requests.
The principle we adopt here consists in only relying on to_forward+send_max.
Indeed, both of those data define how many bytes will leave the buffer.
So as long as their sum is larger than maxrewrite, we can safely
fill the buffers. If they are smaller, then we refrain from filling
the buffer. This means that we won't risk to fill buffers when
reading last data chunk followed by a POST request and its contents.
The only impact identified so far is that we must ensure that the
BF_FULL flag is correctly dropped when starting to forward. Right
now this is OK because nobody inflates to_forward without using
buffer_forward().
Up to now, we only had a flag in the session indicating if it had to
work in "connection: close" mode. This is not at all compatible with
keep-alive.
Now we ensure that both sides of a connection act independantly and
only relative to the transaction. The HTTP version of the request
and response is also correctly considered. The connection already
knows several modes :
- tunnel (CONNECT or no option in the config)
- keep-alive (when permitted by configuration)
- server-close (close the server side, not the client)
- close (close both sides)
This change carefully detects all situations to find whether a request
can be fully processed in its mode according to the configuration. Then
the response is also checked and tested to fix corner cases which can
happen with different HTTP versions on both sides (eg: a 1.0 client
asks for explicit keep-alive, and the server responds with 1.1 without
a header).
The mode is selected by a capability elimination algorithm which
automatically focuses on the least capable agent between the client,
the frontend, the backend and the server. This ensures we won't get
undesired situtations where one of the 4 "agents" is not able to
process a transaction.
No "Connection: close" header will be added anymore to HTTP/1.0 requests
or responses since they're already in close mode.
The server-close mode is still not completely implemented. The response
needs to be rewritten as keep-alive before being sent to the client if
the connection was already in server-close (which implies the request
was in keep-alive) and if the response has a content-length or a
transfer-encoding (but only if client supports 1.1).
A later improvement in server-close mode would probably be to detect
some situations where it's interesting to close the response (eg:
redirections with remote locations). But even then, the client might
close by itself.
It's also worth noting that in tunnel mode, no connection header is
affected in either direction. A tunnelled connection should theorically
be notified at the session level, but this is useless since by definition
there will not be any more requests on it. Thus, we don't need to add a
flag into the session right now.
Implement decreasing health based on observing communication between
HAProxy and servers.
Changes in this version 2:
- documentation
- close race between a started check and health analysis event
- don't force fastinter if it is not set
- better names for options
- layer4 support
Changes in this version 3:
- add stats
- port to the current 1.4 tree
In order to support keepalive, we'll have to differentiate
normal sessions from tunnel sessions, which are the ones we
don't want to analyse further.
Those are typically the CONNECT requests where we don't care
about any form of content-length, as well as the requests
which are forwarded on non-close and non-keepalive proxies.
To sum up :
- len : it's now the max number of characters for the value, preventing
garbaged results.
- a new option "prefix" is added, this allows to use dynamic cookie
names (e.g. ASPSESSIONIDXXX).
Previously in the thread, I wanted to use the value found with
"capture cookie" but when i started to update the documentation, I
found this solution quite weird. I've made a small rework to not
depend on "capture cookie".
- There's the posssiblity to define the URL parser mode (path parameters
or query string).
We now set msg->col and msg->sov to the first byte of non-header.
They will be used later when parsing chunks. A new macro was added
to perform size additions on an http_msg in order to limit the risks
of copy-paste in the long term.
During this operation, it appeared that the http_msg struct was not
optimal on 64-bit, so it was re-ordered to fill the holes.
An HTTP message can be decomposed into several sub-states depending
on the transfer-encoding. We'll have to keep these state information
while parsing chunks, so we must extend the values. In order not to
change everything, we'll now consider that anything >= MSG_BODY is
the body, and that the value indicates the precise state. The
MSG_ERROR status which was greater than MSG_BODY was moved for this.
Right now, an HTTP server cannot track a TCP server and vice-versa.
This patch enables proxy tracking without relying on the proxy's mode
(tcp/http/health). It only requires a matching proxy name to exist. The
original function was renamed to findproxy_mode().
It's a pain to enable regparm because ebtree is built in its corner
and does not depend on the rest of the config. This causes no problem
except that if the regparm settings are not exactly similar, then we
can get inconsistent function interfaces and crashes.
One solution realized in this patch consists in externalizing all
compiler settings and changing CONFIG_XXX_REGPARM into CONFIG_REGPARM
so that we ensure that any sub-component uses the same setting. Since
ebtree used a value here and not a boolean, haproxy's config has been
set to use a number too. Both haproxy's core and ebtree currently use
the same copy of the compiler.h file. That way we don't have any issue
anymore when one setting changes somewhere.
All files referencing the previous ebtree code were changed to point
to the new one in the ebtree directory. A makefile variable (EBTREE_DIR)
is also available to use files from another directory.
The ability to build the libebtree library temporarily remains disabled
because it can have an impact on some existing toolchains and does not
appear worth it in the medium term if we add support for multi-criteria
stickiness for instance.
The code part which waits for an HTTP response has been extracted
from the old function. We now have two analysers and the second one
may re-enable the first one when an 1xx response is encountered.
This has been tested and works.
The calls to stream_int_return() that were remaining in the wait
analyser have been converted to stream_int_retnclose().
This patch has 2 goals :
1. I wanted to test the appsession feature with a small PHP code,
using PHPSESSID. The problem is that when PHP gets an unknown session
id, it creates a new one with this ID. So, when sending an unknown
session to PHP, persistance is broken : haproxy won't see any new
cookie in the response and will never attach this session to a
specific server.
This also happens when you restart haproxy : the internal hash becomes
empty and all sessions loose their persistance (load balancing the
requests on all backend servers, creating a new session on each one).
For a user, it's like the service is unusable.
The patch modifies the code to make haproxy also learn the persistance
from the client : if no session is sent from the server, then the
session id found in the client part (using the URI or the client cookie)
is used to associated the server that gave the response.
As it's probably not a feature usable in all cases, I added an option
to enable it (by default it's disabled). The syntax of appsession becomes :
appsession <cookie> len <length> timeout <holdtime> [request-learn]
This helps haproxy repair the persistance (with the risk of losing its
session at the next request, as the user will probably not be load
balanced to the same server the first time).
2. This patch also tries to reduce the memory usage.
Here is a little example to explain the current behaviour :
- Take a Tomcat server where /session.jsp is valid.
- Send a request using a cookie with an unknown value AND a path
parameter with another unknown value :
curl -b "JSESSIONID=12345678901234567890123456789012" http://<haproxy>/session.jsp;jsessionid=00000000000000000000000000000001
(I know, it's unexpected to have a request like that on a live service)
Here, haproxy finds the URI session ID and stores it in its internal
hash (with no server associated). But it also finds the cookie session
ID and stores it again.
- As a result, session.jsp sends a new session ID also stored in the
internal hash, with a server associated.
=> For 1 request, haproxy has stored 3 entries, with only 1 which will be usable
The patch modifies the behaviour to store only 1 entry (maximum).
When processing a GET or HEAD request in close mode, we know we don't
need to read anything anymore on the socket, so we can disable it.
Doing this can save up to 40% of the recv calls, and half of the
epoll_ctl calls.
For this we need a buffer flag indicating that we're not interesting in
reading anymore. Right now, this flag also disables both polled reads.
We might benefit from disabling only speculative reads, but we will need
at least this flag when we want to support keepalive anyway.
Currently we don't disable the flag on completion, but it does not
matter as we close ASAP when performing the shutw().
The fd_list[] used by sepoll was indexed on the fd number and was only
used to store the equivalent of an integer. Changing it to be merged
with fdtab reduces the number of pointer computations, the code size
and some initialization steps. It does not harm other pollers much
either, as only one integer was added to the fdtab array.
Some rarely information are stored in fdtab, making it larger for no
reason (source port ranges, remote address, ...). Such information
lie there because the checks can't find them anywhere else. The goal
will be to move these information to the stream interface once the
checks make use of it.
For now, we move them to an fdinfo array. This simple change might
have improved the cache hit ratio a little bit because a 0.5% of
performance increase has measured.
This can ensure that data is readily available on a socket when
we accept it, but a bug in the kernel ignores the timeout so the
socket can remain pending as long as the client does not talk.
Use with care.
This alone makes a typical HTML stats dump consume 10% CPU less,
because we avoid doing complex printf calls to drop them later.
Only a few common cases have been checked, those which are very
likely to run for nothing.
It is a bit expensive and complex to use to call buffer_feed()
directly from the request parser, and there are risks that some
output messages are lost in case of buffer full. Since most of
these messages are static, let's have a state dedicated to print
these messages and store them in a specific area shared with the
stats in the session. This both reduces code size and risks of
losing output data.
Capture & display more data from health checks, like
strerror(errno) for L4 failed checks or a first line
from a response for L7 successes/failed checks.
Non ascii or control characters are masked with
chunk_htmlencode() (html stats) or chunk_asciiencode() (logs).
Add two functions to encode input chunk replacing
non-printable, non ascii or special characters
with:
"&#%u;" - chunk_htmlencode
"<%02X>" - chunk_asciiencode
Above functions should be used when adding strings, received
from possible unsafe sources, to html stats or logs.
int get_backend_server(const char *bk_name, const char *sv_name,
struct proxy **bk, struct server **sv);
This function scans the list of backends and servers to retrieve the first
backend and the first server with the given names, and sets them in both
parameters. It returns zero if either is not found, or non-zero and sets
the ones it did not found to NULL. If a NULL pointer is passed for the
backend, only the pointer to the server will be updated.
The stats socket can now run at 3 different levels :
- user
- operator (default one)
- admin
These levels are used to restrict access to some information
and commands. Only the admin can clear all stats. A user cannot
clear anything nor access sensible data such as sessions or
errors.
Consistent hashing provides some interesting advantages over common
hashing. It avoids full redistribution in case of a server failure,
or when expanding the farm. This has a cost however, the hashing is
far from being perfect, as we associate a server to a request by
searching the server with the closest key in a tree. Since servers
appear multiple times based on their weights, it is recommended to
use weights larger than approximately 10-20 in order to smoothen
the distribution a bit.
In some cases, playing with weights will be the only solution to
make a server appear more often and increase chances of being picked,
so stats are very important with consistent hashing.
In order to indicate the type of hashing, use :
hash-type map-based (default, old one)
hash-type consistent (new one)
Consistent hashing can make sense in a cache farm, in order not
to redistribute everyone when a cache changes state. It could also
probably be used for long sessions such as terminal sessions, though
that has not be attempted yet.
More details on this method of hashing here :
http://www.spiteful.com/2008/03/17/programmers-toolbox-part-3-consistent-hashing/
Recent "struct chunk rework" introduced a NULL pointer dereference
and now haproxy segfaults if auth is required for stats but not found.
The reason is that size_t cannot store negative values, but current
code assumes that "len < 0" == uninitialized.
This patch fixes it.
There are a few remaining max values that need to move to counters.
Also, the counters are more often used than some config information,
so get them closer to the other useful struct members for better cache
efficiency.