State and offsets within http_msg were incorrectly set to signed int.
Turning them into unsigned slightly improved performance while reducing
code size.
Now when a server has "redir <prefix>" on its config line, any HEAD or GET
request addressing it will lead to a 302 with Location set to "<prefix>"
immediately followed by the relative URI of the incoming request. This makes
it very easy to send redirect to browsers to check remote static servers, as
well as to provide redirection for remote sites when the local one is down.
The servers now support the "redir" keyword, making it possible to
return a 302 with the specified prefix in front of the request instead
of connecting to them. This is generally useful for multi-site load
balancing but may also serve in order to achieve very high traffic
rate.
The keyword has only been added to the config parser and to structures,
it's not used yet.
This patch adds two new variables: fastinter and downinter.
When server state is:
- non-transitionally UP -> inter (no change)
- transitionally UP (going down), unchecked or transitionally DOWN (going up) -> fastinter
- down -> downinter
It allows to set something like:
server sr6 127.0.51.61:80 cookie s6 check inter 10000 downinter 20000 fastinter 500 fall 3 weight 40
In the above example haproxy uses 10000ms between checks but as soon as
one check fails fastinter (500ms) is used. If server is down
downinter (20000) is used or fastinter (500ms) if one check pass.
Fastinter is also used when haproxy starts.
New "timeout.check" variable was added, if set haproxy uses it as an additional
read timeout, but only after a connection has been already established. I was
thinking about using "timeout.server" here but most people set this
with an addition reserve but still want checks to kick out laggy servers.
Please also note that in most cases check request is much simpler
and faster to handle than normal requests so this timeout should be smaller.
I also changed the timeout used for check connections establishing.
Changes from the previous version:
- use tv_isset() to check if the timeout is set,
- use min("timeout connect", "inter") but only if "timeout check" is set
as this min alone may be to short for full (connect + read) check,
- debug code (fprintf) commented/removed
- documentation
Compile tested only (sorry!) as I'm currently traveling but changes
are rather small and trivial.
Commit 8b3977ffe3 removed "t->logs.bytes_in = 0;"
but instead it should change it into "t->logs.bytes_out = 0;" as since
583bc96606 counters are incremented not set.
It should be incremented in session_process_counters while sending data to a
client:
bytes = s->rep->total - s->logs.bytes_out;
s->logs.bytes_out = s->rep->total;
However, if we increment (set) s->logs.bytes_out while processing
"logasap", statistics get wrong values added for headers: 0 or even
negative if haproxy adds some headers itself.
To test it, please enable logasap and download one empty file and look at
stats. Without my fix information available on that page are invalid, for
example:
# pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,
www,b,0,0,0,1,,1,24,-92,,0,,0,0,0,,UP,1,1,0,0,0,3121,0,,1,2,1,,1,
www,BACKEND,0,0,0,1,0,1,24,-92,0,0,,0,0,0,0,UP,1,1,0,,0,3121,0,,1,2,0,,1,
There's no point trying to check original dest addr with only one
method when doing transparent proxy as in full transparent mode,
the real destination address is required. Let's copy the one from
the frontend.
Due to the way Linux delivers EPOLLIN and EPOLLHUP, a closed connection
received after some server data sometimes results in truncated responses
if the client disconnects before server starts to respond. The reason
is that the EPOLLHUP flag is processed as an indication of end of
transfer while some data may remain in the system's socket buffers.
This problem could only be triggered with sepoll, although nothing should
prevent it from happening with normal epoll. In fact, the work factoring
performed by sepoll increases the risk that this bug appears.
The fix consists in making FD_POLL_HUP and FD_POLL_ERR sticky and that
they are only checked if FD_POLL_IN is not set, meaning that we have
read all pending data.
That way, the problem is definitely fixed and sepoll still remains about
17% faster than epoll since it can take into account all information
returned by the kernel.
The source address selection for health checks did not consider
the new transparent proxy method. Rely on the same unified function
as the other connect() calls.
This patch also fixes a bug by which the proxy's source address was
ignored if cttproxy was used.
Balabit's TPROXY version 4 which replaces CTTPROXY provides a similar
API to the previous proxy, but relies on IP_FREEBIND instead of
IP_TRANSPARENT. Let's add it.
Using some Linux kernel patches which add the IP_TRANSPARENT
SOL_IP option , it is possible to bind to a non-local address
on without having resort to any sort of NAT, thus causing no
performance degradation.
This is by far faster and cleaner than the previous CTTPROXY
method. The code has been slightly changed in order to remain
compatible with CTTPROXY as a fallback for the new method when
it does not work.
It is not needed anymore to specify the outgoing source address
for connect, it can remain 0.0.0.0.
Using some Linux kernel patches, it is possible to redirect non-local
traffic to local sockets when IP forwarding is enabled. In order to
enable this option, we introduce the "transparent" option keyword on
the "bind" command line. It will make the socket reachable by remote
sources even if the destination address does not belong to the machine.
a copy-paste typo was present in the reconnection code responsible
for respatching. The client's FSM would not be re-evaluated if an
error occurred. It looks harmless but better fix it.
Several users have complained that when haproxy gets a connection
failure due to an active reject from a server, it immediately
retries, often leading to the same situation being repeated until
the retry counter reaches zero.
Now if a connection error shows up, a turn-around state of 1 second
is applied before retrying. This is performed by faking a connection
timeout in order not to touch much code. However, a cleaner method
would involve an extra state.
This patch extends a little previously added functionality to also
count retries and redispatches for servers. Now it is possible to know
which server causes redispatches as it is not always the same that takes
most retries.
While working with the code I found that redistribute_pending() does not increment
srv->redispatches && be->redispatches. I don't know how to test it but
I think the fix is correct. If not I can withdraw it.
I also extended logs to show how many retries were done and if redispatching
was necessary ('+'). I'm using an additional session flag SN_REDISP to match
redispatched connections. I had to rearrange all defines in session.h to make
more room for it.
The documentation about logs was also fixed a little (sorry, english only),
as current version uses totally different format. BTW: examples are still
outdated, maybe next time...
Finally, I changed %d -> %u for retries/redispatches as those variables
are declared as unsigned.
It was abnormal to see more connect errors than connect attempts.
This was caused by the fact that the server's connection count was
not incremented for failed connect() attempts.
Now the per-server connections are correctly incremented for each
connect() attempt. This includes the retries too. The number of
connections effectively served by a server will then be :
srv->cum_sess - srv->errors - srv->warnings
In order to offer DoS protection, it may be required to lower the maximum
accepted time to receive a complete HTTP request without affecting the client
timeout. This helps protecting against established connections on which
nothing is sent. The client timeout cannot offer a good protection against
this abuse because it is an inactivity timeout, which means that if the
attacker sends one character every now and then, the timeout will not
trigger. With the HTTP request timeout, no matter what speed the client
types, the request will be aborted if it does not complete in time.
This new parameter makes it possible to override the default
number of consecutive incoming connections which can be
accepted on a socket. By default it is not limited on single
process mode, and limited to 8 in multi-process mode.
Add the "backlog" parameter to frontends, to give hints to
the system about the approximate listen backlog desired size.
In order to protect against SYN flood attacks, one solution is
to increase the system's SYN backlog size. Depending on the
system, sometimes it is just tunable via a system parameter,
sometimes it is not adjustable at all, and sometimes the system
relies on hints given by the application at the time of the
listen() syscall. By default, HAProxy passes the frontend's
maxconn value to the listen() syscall. On systems which can
make use of this value, it can sometimes be useful to be able
to specify a different value, hence this backlog parameter.
It is sometimes required to know some informations such as the
process uptime when consulting statistics. This patch adds the
"show info" command to query those informations on the UNIX
socket.
The build process was getting annoying under some conditions,
especially on platforms which are used to set CFLAGS, as well
as those which set a lot of complex defines. The new Makefile
takes care of this situation by not mixing TARGET, CPU and user
values, and by making privileging the pre-setting of common
variables with the ability to override them.
Now CFLAGS and LDFLAGS are set by default and may be overridden
without the risk of breaking useful defines. Options are better
dealt with, and as a bonus, it was possible to merge the FreeBSD
and OpenBSD targets into the common GNU Makefile.
The report of build options by "haproxy -vv" has been slightly
adapted to the new mode. Options implied by architecture are not
reported, only user-specified options are. It is also possible to
add options which will not be reported in order not to mangle the
output when specifying dirty informations such as URLs...
The Makefile was copiously documented and it should be easier to
build for any target now. Backwards compatibility with older
build processes was kept, and warnings are emitted for deprecated
build options.
This patch adds a possibility to invert most of available options by
introducing the "no" keyword, available as an additional prefix.
If it is found arguments are shifted left and an additional flag (inv)
is set.
It allows to use all options from a current defaults section, except
the selected ones, for example:
-- cut here --
defaults
contimeout 4200
clitimeout 50000
srvtimeout 40000
option contstats
listen stats 1.2.3.4:80
no option contstats
-- cut here --
Currenly inversion works only with the "option" keyword.
The patch also moves last_checks calculation at the end of the readcfgfile()
function and changes "PR_O_FORCE_CLO | PR_O_HTTP_CLOSE" into "PR_O_FORCE_CLO"
in cfg_opts so it is possible to invert forceclose without breaking httpclose
(and vice versa) and to invert tcpsplice in one proxy but to keep a proper
last_checks value when tcpsplice is used in another proxy. Now, the code
checks for PR_O_FORCE_CLO everywhere it checks for PR_O_HTTP_CLOSE.
I also decided to depreciate "redisp" and "redispatch" keywords as it is IMHO
better to use "option redispatch" which can be inverted.
Some useful documentation were added and at the same time I sorted
(alfabetically) all valid options both in the code and the documentation.
The error check in return of start_proxies checked for exact ERR_RETRYABLE
but did not consider the return as a bit field. The function returned both
ERR_RETRYABLE and ERR_ALERT, hence the problem.
Solaris, as well as many other unixes doesn't know about sun_len
for UNIX domain sockets. It does not honnor the __SOCKADDR_COMMON
macro either. After looking at MacOS-X man (which is the same as
BSD man), OpenBSD man, and examples on the net, it appears that
those which support sun_len do not actually use it, or at least
ignore it as long as it's zero. Since all the sockaddr structures
are zeroed prior to being filled, it causes no problem not to set
sun_len, and this fixes build on other platforms.
Another problem on Solaris was that the "sun" name is already
defined as a macro returning a number, so it was necessary to
rename it.
The code in haproxy-1.3.13.1 only supports syslogging to an internet
address. The attached patch:
- Adds support for syslogging to a UNIX domain socket (e.g., /dev/log).
If the address field begins with '/' (absolute file path), then
AF_UNIX is used to construct the socket. Otherwise, AF_INET is used.
- Achieves clean single-source build on both Mac OS X and Linux
(sockaddr_in.sin_len and sockaddr_un.sun_len field aren't always present).
For handling sendto() failures in send_log(), it appears that the existing
code is fine (no need to close/recreate socket) for both UDP and UNIX-domain
syslog server. So I left things alone (did not close/recreate socket).
Closing/recreating socket after each failure would also work, but would lead
to increased amount of unnecessary socket creation/destruction if syslog is
temporarily unavailable for some reason (especially for verbose loggers).
Please consider this patch for inclusion into the upstream haproxy codebase.
One user reported that an indicator was missing in the statistics:
the number of times each server was selected by load balancing. It
is in fact the total number of sessions assigned to a server by the
load balancing algorithm. It should directly reflect the weight for
"fair" algorithms such as round-robin, since it will not account for
persistant connections.
It should help a lot tuning each server's weight depending on the
load it receives.
Because of a divide, it was possible to have a null weight during
a slowstart, which is pretty annoying, especially with a single
server and a long slowstart.
Also, fix the way we report the values in the stats page to avoid
confusion.
A new "timeout" keyword replaces old "{con|cli|srv}timeout", and
provides the ability to independantly set the following timeouts :
- client
- tarpit
- queue
- connect
- server
- appsession
Additionally, the "clitimeout", "contimeout" and "srvtimeout" values
are supported but deprecated. No warning is emitted yet when they are
used since the option is very new.
Other timeouts should follow soon now.
Now the connect timeout, tarpit timeout and queue timeout are
distinct. In order to retain compatibility with older versions,
if either queue or tarpit is left unset both in the proxy and
in the default proxy, then it is inherited from the connect
timeout as before.
It is not always handy to manipulate large values exprimed
in milliseconds for timeouts. Also, some values are entered
in seconds (such as the stats refresh interval). This patch
adds support for time units. It knows about 'us', 'ms', 's',
'm', 'h', and 'd'. It automatically converts each value into
the caller's expected unit. Unit-less values are still passed
unchanged.
The unit must be passed as a suffix to the number. For instance:
clitimeout 15m
If any character is not understood, an error is returned.
This new function accepts inputs in various default units, from
the microsecond to the day. It detects suffixes after numbers
and performs the appropriate conversions between the user's unit
and the program's unit, considering a unit-less number in the
default unit.
In order to avoid issues in the future, we want to restrict
the set of allowed characters for identifiers. Starting from
now, only A-Z, a-z, 0-9, '-', '_', '.' and ':' will be allowed
for a proxy, a server or an ACL name.
A test file has been added to check the restriction.
Sometimes it is useful to find out how a given binary version was
built. The build compiler and options are now provided for this,
and it's possible to get them with the -vv option.
Now we can compute the max place depending on the number of servers,
maximum weight and weight scale. The formula has been stored as a
comment so that it's easy to choose between smooth weight ramp up
and high number of servers. The default scale has been set to 16,
which permits 4000 servers with a granularity of 6% in the worst
case (weight=1).
Due to the fact that htons is defined as a macro, it's dangerous
to call it with auto-incremented arguments such as htons(f(++x)) :
src/standard.c: In function 'url2sa':
src/standard.c:291: warning: operation on 'curr' may be undefined
The solution is simply to store the intermediate result an pass it
to htons() at once.
Under certain circumstances, it is very useful to be able to fail some
monitor requests. One specific case is when the number of servers in
the backend falls below a certain level. The new "monitor fail" construct
followed by either "if"/"unless" <condition> makes it possible to specify
ACL-based conditions which will make the monitor return 503 instead of
200. Any number of conditions can be passed. Another use may be to limit
the requests to local networks only.
The new "nbsrv" ACL verb matches the number of active servers in a backend.
By default, it applies to the backend where it is declared, but optionally
it can receive the name of another backend as an argument in parenthesis.
It counts the number of enabled active servers first, then the number of
enabled backup servers.
AIX does not know about MSG_DONTWAIT. Fortunately, nearly all sockets
are already set to O_NONBLOCK, so it's not even required to change the
code. It was only necessary to add this fcntl to the log socket which
lacked it. The MSG_DONTWAIT value has been defined to zero when unset
in order to make the code cleaner and more portable.
Also, on AIX, "hz" is defined, which causes a problem with one function
parameter in time.c. It's enough to rename the parameter there. Last,
fix a missing #include <string.h> in proxy.c.
A new "throttle" column has been added to HTML and RAW stats to indicate
in percent, the level of throttling due to server warmup. The column is
empty at 100%.
The new 'slowstart' parameter for a server accepts a value in
milliseconds which indicates after how long a server which has
just come back up will run at full speed. The speed grows
linearly from 0 to 100% during this time. The limitation applies
to two parameters :
- maxconn: the number of connections accepted by the server
will grow from 1 to 100% of the usual dynamic limit defined
by (minconn,maxconn,fullconn).
- weight: when the backend uses a dynamic weighted algorithm,
the weight grows linearly from 1 to 100%. In this case, the
weight is updated at every health-check. For this reason, it
is important that the 'inter' parameter is smaller than the
'slowstart', in order to maximize the number of steps.
The slowstart never applies when haproxy starts, otherwise it
would cause trouble to running servers. It only applies when
a server has been previously seen as failed.
It's important to be able to distinguish between servers which are UP
and those which are UP but disabled via a 404 response. For this reason,
the status entries report "NOLB" instead of "UP", and the HTML page uses
darker colors. As a complement, write "DOWN" in bold red on the backend
if it has no server left for load balancing.
It's not always obvious for the callers of set_server_status_{up,down}
whether the new state really is up or down. Some flags as well as the
effective weight have to be considered. Let's ensure that those functions
perform the necessary check themselves so that if the state transition
cannot be performed, at least everything is updated as required.
When an HTTP server returns "404 not found", it indicates that at least
part of it is still running. For this reason, it can be convenient for
application administrators to be able to consider code 404 as valid,
but for a server which does not want to participate to load balancing
anymore. This is useful to seamlessly exclude a server from a farm
without acting on the load balancer. For instance, let's consider that
haproxy checks for the "/alive" file. To enable load balancing on a
server, the admin would simply do :
# touch /var/www/alive
And to disable the server, he would simply do :
# rm /var/www/alive
Another immediate gain from doing this is that it is now possible to
send NOTICE messages instead of ALERT messages when a server is first
disable, then goes down. This provides a graceful shutdown method.
To enable this behaviour, specify "http-check disable-on-404" in the
backend.
Hello,
You will find attached an updated release of previously submitted patch.
It polish some part and extend ACL engine to match IP and PORT parsed in
HTTP request. (and take care of comments made by Willy ! ;))
Best regards,
Alexandre
The number of possible options for a proxy has already reached
32, which is the current limit due to the fact that they are
each represented as a bit in a 32-bit word.
It's possible to move the load balancing algorithms to another
place. It will also save some space for future algorithms.
This round robin algorithm was written from trees, so that we
do not have to recompute any table when changing server weights.
This solution allows on-the-fly weight adjustments with immediate
effect on the load distribution.
There is still a limitation due to 32-bit computations, to about
2000 servers at full scale (weight 255), or more servers with
lower weights. Basically, sum(srv.weight)*4096 must be below 2^31.
Test configurations and an example program used to develop the
tree will be added next.
Many changes have been brought to the weights computations and
variables in order to accomodate for the possiblity of a server to
be running but disabled from load balancing due to a null weight.
Under some circumstances, it will be useful to be able to have
a server's effective weight bigger than the user weight, and this
is particularly true for dynamic weight-based algorithms. In order
to support this, we add a "wdiv" member to the lbprm structure
which will always be used to divide the weights before reporting
them.
Since the introduction of server weights, all load balancing algorithms
relied on a pre-computed map. Incidently, quite a bunch of map-specific
parameters were used at random places in order to get the number of
servers or their total weight. It was not architecturally acceptable
that optimizations for the map computation had impact on external parts.
For instance, during this cleanup it was found that a backend weight was
seen as 1 when only the first backup server is used, whatever its weight.
This cleanup consists in differentiating between LB-generic parameters,
such as total weights, number of servers, etc... and map-specific ones.
The struct proxy has been enhanced in order to make it easier to later
support other algorithms. The recount_servers() function now also
updates generic values such as total weights so that it's not needed
anymore to call recalc_server_map() when weights are needed. This
permitted to simplify some code which does not need to know about map
internals anymore.
It was possible to slightly reduce the size and the number of
operations in session_process_counters(). Two 64 bit comparisons
were removed, reducing the code by 98 bytes on x86 due to the lack
of registers. The net observed performance gain is almost 2%, which
cannot be attributed to those optimizations, but more likely to
induced changes in code alignment in other functions.
By default, counters used for statistics calculation are incremented
only when a session finishes. It works quite well when serving small
objects, but with big ones (for example large images or archives) or
with A/V streaming, a graph generated from haproxy counters looks like
a hedgehog.
This patch implements a contstats (continous statistics) option.
When set counters get incremented continuously, during a whole session.
Recounting touches a hotpath directly so it is not enabled by default,
as it has small performance impact (~0.5%).
It is very convenient for SNMP monitoring to have unique process ID,
proxy ID and server ID. Those have been added to the CSV outputs.
The numbers start at 1. 0 is reserved. For servers, 0 means that the
reported name is not a server name but half a proxy (FRONTEND/BACKEND).
A remaining hidden "-" in the CSV output has been eliminated too.
Proxy listeners were very special and not very easy to manipulate.
A proto_tcp file has been created with all that is required to
manage TCPv4/TCPv6 as raw protocols, and provide generic listeners.
The code of start_proxies() and maintain_proxies() now looks less
like spaghetti. Also, event_accept will need a serious lifting in
order to use more of the information provided by the listener.
It is important that unbind_listener() calls fd_delete() to remove
a file descriptor, because only this one can update the fdtab and
the maxfd entries.
There was a missing state for listeners, when they are not listening
but still attached to the protocol. The LI_ASSIGNED state was added
for this purpose. This permitted to clean up the assignment/release
workflow quite a bit. Generic enable/enable_all/disable/disable_all
primitives were added, and a disable_all entry was added to the
struct protocol.
The error path in event_accept() was complicated by many code
duplications. Use the classical unrolling with the gotos. This
fix alone reduced the code by 2.5 kB.
Small optimization: in some cases, it's not interesting to call
functions which are dedicated to checking the cache headers or
cookies. Avoid calling them when not necessary.
It's not easy to report useful information to help the user quickly
fix a configuration. This patch :
- removes the word "listener" in favor of "proxy" as it has been
used since the beginning ;
- ensures that the same function (hence the same words) will be
used to report capabilities of a proxy being declared and an
existing proxy ;
- avoid the term "conflicting capabilities" in favor of "overlapping
capabilities" which is more exact.
- just report that the same name is reused in case of warnings
This patch:
- adds proxy_mode_str() similar to proxy_type_str()
- adds a generic findproxy function used with default_backend/setbe/use_backed
- rewrite default_backend/senbe/use_backed to use introduced findproxy()
- relaxes duplicated proxy check
- changes capabilities displaying from "%X" to "%s" with a call to proxy_type_str()
The default_backend did not work in TCP mode since there was no
header state to assign the backend. This causes much trouble when
configs are created by copy-paste.
The solution was to fix the way the backend is assigned upon accept().
A wrong contimeout assignment was fixed too.
Some applications do not have a strict persistence requirement, yet
it is still desirable for performance considerations, due to local
caches on the servers. For some reasons, there are some applications
which cannot rely on cookies, and for which the last resort is to use
a parameter passed in the URL.
The new 'url_param' balance method is there to solve this issue. It
accepts a parameter name which is looked up from the URL and which
is then hashed to select a server. If the parameter is not found,
then the round robin algorithm is used in order to provide a normal
load balancing across the servers for the first requests. It would
have been possible to use a source IP hash instead, but since such
applications are generally buried behind multiple levels of
reverse-proxies, it would not provide a good balance.
The doc has been updated, and two regression testing configurations
have been added.
A new function "backend_parse_balance" has been created in backend.c,
which is dedicated to the parsing of the "balance" keyword. It will
provide easier methods for adding new algorithms.
In preparation for newer balance algorithms, group the
sparse PR_O_BALANCE_* values into layer4 and layer7-based
algorithms. This will ease addition of newer algorithms.
Currently, there is a hidden line length limit in the haproxy, set
to 256-1 chars. With large acls (for example many hdr(host) matches)
it may be not enough and which is even worse, error message may
be totally confusing as everything above this limit is treated
as a next line:
echo -ne "frontend aqq 1.2.3.4:80\nmode http\nacl e hdr(host) -i X X X X X X X www.xx.example.com stats\n"|
sed s/X/www.some-host-name.example.com/g > ha.cfg && haproxy -c -f ./ha.cfg
[WARNING] 300/163906 (11342) : parsing [./ha.cfg:4] : 'stats' ignored because frontend 'aqq' has no backend capability.
Recently I hit simmilar problem and it took me a while to find why
requests for "stats" are not handled properly.
This patch:
- makes the limit configurable (LINESIZE)
- increases default line length limit from 256 to 2048
- increases MAX_LINE_ARGS from 40 to 64
- fixes hidden assignment in fgets()
- moves arg/end/args/line inside the loop, making code auditing easier
- adds a check that shows error if the limit is reached
- changes "*line++ = 0;" to "*line++ = '\0';" (cosmetics)
With this patch, when LINESIZE is defined to 256, above example produces:
[ALERT] 300/164724 (27364) : parsing [/tmp/ha.cfg:3]: line too long, limit: 255.
[ALERT] 300/164724 (27364) : Error reading configuration file : /tmp/ha.cfg
Hello,
As it is possible to use the same name for two proxies, make sure that
use_backed & friends does not match wrong proxy when used with use_backend/
default_backend/setbe. For example, without this patch, when there is a
backend and frontend with the same name (first backend and then frontend
trying to use specific backend), the application will likely try to use
frontend instead of backend, complaining loudly about a loop.
Best regards,
Krzysztof Oledzki
This patch adds the "maxqueue" parameter to the server. This allows new
sessions to be immediately rebalanced when the server's queue is filled.
It's useful when session stickiness is just a performance boost (even a
huge one) but not a requirement.
This should only be used if session affinity isn't a hard functional
requirement but provides performance boost by keeping server-local
caches hot and compact).
Absence of 'maxqueue' option means unlimited queue. When queue gets filled
up to 'maxqueue' client session is moved from server-local queue to a global
one.
This is in fact the same as ultoa() except that it's possible to
pass the string to be returned in case the value is NULL. This is
useful to report limits in printf calls.
Current ultoa() function is limited to one use per expression or
function call. Sometimes this is limitating. Change this in favor
of an array of 10 return values and shorter macros U2A0..U2A9
which respectively call the function with the 10 different buffers.
localtime() was called with pointers to tv_sec, which is time_t on
some platforms and long on others. A problem was encountered on
Sparc64 under OpenBSD where tv_sec is long (64 bits) and time_t is
32 bits. Since this architecture is big-endian, it exhibited the
bug because localtime() always worked with the high part of the
value which is always zero. This problem was identified and debugged
by Thierry Fournier.
The correct solution is to pass the date by value and not by pointer,
through an intermediate function. The use of localtime_r() instead of
localtime() also made it possible to get rid of the first call to
localtime() since it does not need to allocate memory anymore.
The strl2ic() and strl2uic() primitives used to convert string to
integers could return 10 times the value read if they stopped on
non-digit because of a mis-placed loop exit.
Hello,
This patch implements new statistics for SLA calculation by adding new
field 'Dwntime' with total down time since restart (both HTTP/CSV) and
extending status field (HTTP) or inserting a new one (CSV) with time
showing how long each server/backend is in a current state. Additionaly,
down transations are also calculated and displayed for backends, so it is
possible to know how many times selected backend was down, generating "No
server is available to handle this request." error.
New information are presentetd in two different ways:
- for HTTP: a "human redable form", one of "100000d 23h", "23h 59m" or
"59m 59s"
- for CSV: seconds
I believe that seconds resolution is enough.
As there are more columns in the status page I decided to shrink some
names to make more space:
- Weight -> Wght
- Check -> Chk
- Down -> Dwn
Making described changes I also made some improvements and fixed some
small bugs:
- don't increment s->health above 's->rise + s->fall - 1'. Previously it
was incremented an then (re)set to 's->rise + s->fall - 1'.
- do not set server down if it is down already
- do not set server up if it is up already
- fix colspan in multiple places (mostly introduced by my previous patch)
- add missing "status" header to CSV
- fix order of retries/redispatches in server (CSV)
- s/Tthen/Then/
- s/server/backend/ in DATA_ST_PX_BE (dumpstats.c)
Changes from previous version:
- deal with negative time intervales
- don't relay on s->state (SRV_RUNNING)
- little reworked human_time + compacted format (no spaces). If needed it
can be used in the future for other purposes by optionally making "cnt"
as an argument
- leave set_server_down mostly unchanged
- only little reworked "process_chk: 9"
- additional fields in CSV are appended to the rigth
- fix "SEC" macro
- named arguments (human_time, be_downtime, srv_downtime)
Hope it is OK. If there are only cosmetic changes needed please fill free
to correct it, however if there are some bigger changes required I would
like to discuss it first or at last to know what exactly was changed
especially since I already put this patch into my production server. :)
Thank you,
Best regards,
Krzysztof Oledzki
Currently haproxy accepts a config with duplicated proxies
(listen/fronted/backed/ruleset). This patch fix this, so the application
will complain when there is an error.
With this modification it is still possible to use the same name for two
proxies (for example frontend&backend) as long there is no conflict:
listen backend frontend ruleset
listen - - - -
backend - - OK -
frontend - OK - -
ruleset - - - -
Best regards,
Krzysztof Oledzki
It is important to know how your installation performs. Haproxy masks
connection errors, which is extremely good for a client but it is bad for
an administrator (except people believing that "ignorance is a bless").
Attached patch adds retries and redispatches counters, so now haproxy:
1. For server:
- counts retried connections (masked or not)
2. For backends:
- counts retried connections (masked or not) that happened to
a slave server
- counts redispatched connections
- does not count successfully redispatched connections as backend errors.
Errors are increased only when client does not get a valid response,
in other words: with failed redispatch or when this function is not
enabled.
3. For statistics:
- display Retr (retries) and Redis (redispatches) as a "Warning"
information.
Removed old unused MODE_LOG and MODE_STATS, and replaced the "stats"
keyword in the global section. The new "stats" keyword in the global
section is used to create a UNIX socket on which the statistics will
be accessed. The client must issue a "show stat\n" command in order
to get a CSV-formated output similar to the output on the HTTP socket
in CSV mode.
A unix socket can now access the statistics. It currently only
recognizes the "show stat\n" command at the beginning of the
input, then returns the statistics in CSV format.
It is now possible to get CSV ouput from the statistics by
simply appending ";csv" to the HTTP request sent to get the
stats. The fields keep the same ordering as in the HTML page,
and a field "pxname" has been prepended at the beginning of
the line.
A new file, proto_uxst.c, implements support of PF_UNIX sockets
of type SOCK_STREAM. It relies on generic stream_sock_read/write
and uses its own accept primitive which also tries to be generic.
Right now it only implements an echo service in sight of a general
support for start dumping via unix socket. The echo code is more
of a proof of concept than useful code.
A new generic protocol mechanism has been added. It provides
an easy method to implement new protocols with different
listeners (eg: unix sockets).
The listeners are automatically started at the right moment
and enabled after the possible fork().
This must come from a copy-paste typo: in the unlikely event that
fork() would fail, the parent process would only exit(1) if there
were old pids. That's non-sense.
In case the incoming socket is set for write and not for read (very
unlikely, except in HEALTH mode), the timeout may remain eternity due
to a copy-paste typo.
The stream_sock_* functions had to know about sessions just in
order to get the server's address for a connect() operation. This
is not desirable, particularly for non-IP protocols (eg: PF_UNIX).
Put a pointer to the peer's sockaddr_storage or sockaddr address
in the fdtab structure so that we never need to look further.
With this small change, the stream_sock.c file is now 100% protocol
independant.
For people who manage many haproxies, it is sometimes convenient
to be informed of their version. This patch adds this, with the
option to disable this report by specifying "stats hide-version".
Also, the feature may be permanently disabled by setting the
STATS_VERSION_STRING to "" (empty string), or the format can
simply be adjusted.
Initial patch only managed to spread the checks when the checks
failed. The randomization code needs to be added also in the path
where the server is going fine.
When one server in one backend has a very low check interval, it imposes
its value as the minimal interval, causing all other servers to start
their checks close to each other, thus partially voiding the benefits of
the spread checks.
The solution consists in ignoring intervals lower than a given value
(SRV_CHK_INTER_THRES = 1000 ms) when computing the minimal interval,
and then assigning them a start date relative to their own interval
and not the global one.
With this change, the checks distribution clearly looks better.
When one server appears at the same position in multiple backends, it
receives all the checks from all the backends exactly at the same time
because the health-checks are only spread within a backend but not
globally.
Attached patch implements per-server start delay in a different way.
Checks are now spread globally - not locally to one backend. It also makes
them start faster - IMHO there is no need to add a 'server->inter' when
calculating first execution. Calculation were moved from cfgparse.c to
checks.c. There is a new function start_checks() and now it is not called
when haproxy is started in MODE_CHECK.
With this patch it is also possible to set a global 'spread-checks'
parameter. It takes a percentage value (1..50, probably something near
5..10 is a good idea) so haproxy adds or removes that many percent to the
original interval after each check. My test shows that with 18 backends,
54 servers total and 10000ms/5% it takes about 45m to mix them completely.
I decided to use rand/srand pseudo-random number generator. I am aware it
is not recommend for a good randomness but a) we do not need a good random
generator here b) it is probably the most portable one.
The following patch will give the ability to tweak socket linger mode.
You can use this option with "option nolinger" inside fronted or backend
configuration declaration.
This will help in environments where lots of FIN_WAIT sockets are
encountered.
I noticed that haproxy, with "cookie (...) nocache" option, always adds
"Cache-control: private" at the end of a header list received from this
server:
Cache-Control: no-cache
(...)
Set-Cookie: SERVERID=s6; path=/
Cache-control: private
or:
Set-Cookie: ASPSESSIONIDCSRCTSSB=HCCBGGACGBHDHMMKIOILPHNG; path=/
Cache-control: private
Set-Cookie: SERVERID=s5; path=/
Cache-control: private
It may be just redundant (two "Cache-control: private"), but sometimes it
may be quite confused as we may end with two different, more and less
restricted directions (no-cache & private) and even quite conflicting
directions (eg. public & private):
So, I added and rearranged a code, so now haproxy adds a "Cache-control:
private" header only when there is no the same (private) or more
restrictive (no-cache) one. It was done in three steps:
1. Use check_response_for_cacheability to check if response is
not cacheable. I simply moved this call before http_header_add_tail2.
2. Use TX_CACHEABLE (not TX_CACHE_COOK - apache <= 1.3.26) to check if we
need to add a Cache-control header. If we add it, clear TX_CACHEABLE and
TX_CACHE_COOK.
3. Check cacheability not only with PR_O_CHK_CACHE but also with
PR_O_COOK_NOC, so:
- unlikely(t->be->options & PR_O_CHK_CACHE))
+ (t->be->options & (PR_O_CHK_CACHE|PR_O_COOK_NOC)))
txn->flags |= TX_CACHEABLE | TX_CACHE_COOK;
I removed this unlikely since I believe that now it is not so unlikely.
The patch is definitely not perfect, proxy should probably also remove
"Cache-control: public". Unfortunately, I do not know the code good enough
to do in myself, yet. ;)
Anyway, I think that even now, it should be very useful.
On Sat, 22 Sep 2007, Willy Tarreau wrote:
> On Sun, Sep 23, 2007 at 03:23:38AM +0200, Krzysztof Oledzki wrote:
> > I noticed that with httpchk, haproxy generates TCP RST at end of a check.
> > IMHO, it would be more polite to send FIN to a server, especially that
> > each TCP RST found by a tcpdump makes me concerned that something is
> > wrong, as it is hard to distinguish between a RST from a httpchk and from
> > a normal request, forwarded for a client.
>
> I have also noticed it very recently. In fact, it's never the
> application (here haproxy) which decides to send an RST, it's the
> system. It does so because the server returns data on a terminated
> socket. I guess it's because the health-check code does not read much
> of the response. In fact, we just need to read enough to process common
> responses. If people are dumb enough to check with something like "GET
> /image.iso", they should expect to get an RST after a few kbytes
> instead of reading the whole file!
Right, that was easy. Attached patch changed what you described. Now
haproxy finishes http checks with FIN.
This patch fixes a nasty bug raported by both glibc and valgrind, which
leads into a problem that haproxy does not exit when a new instace
starts ap (-sf/-st).
==9299== Invalid free() / delete / delete[]
==9299== at 0x401D095: free (in
/usr/lib/valgrind/x86-linux/vgpreload_memcheck.so)
==9299== by 0x804A377: deinit (haproxy.c:721)
==9299== by 0x804A883: main (haproxy.c:1014)
==9299== Address 0x41859E0 is 0 bytes inside a block of size 21 free'd
==9299== at 0x401D095: free (in
/usr/lib/valgrind/x86-linux/vgpreload_memcheck.so)
==9299== by 0x804A84B: main (haproxy.c:985)
==9299==
6542 open("/dev/tty", O_RDWR|O_NONBLOCK|O_NOCTTY) = -1 ENOENT (No such file
or directory)
6542 writev(2, [{"*** glibc detected *** ", 23}, {"corrupted double-linked
list", 28}, {": 0x", 4}, {"6ff91878", 8}, {" ***\n", 5}], 5) = -1 EBADF (Bad
file descriptor)
I found this bug trying to find why, after one week with many restarts, I
finished with >100 haproxy process running. ;)
When switching from a frontend to a backend, the "retries" parameter
was not kept, resulting in the impossibility to reconnect after the
first connection failure. This problem was reported and analyzed by
Krzysztof Oledzki.
While fixing the code, it appeared that some of the backend's timeouts
were not updated in the session when using "use_backend" or "default_backend".
It seems this had no impact but just in case, it's better to set them as
they should have been.
Since the timers have been changed, the timeouts for the default instance
have not been adjusted. This results in unspecified timeouts becoming zero
instead of infinite.
The syslog UDP socket may receive data, which is not cool because those
data accumulate in the system buffers up to the receive socket buffer size.
To prevent this, we set the receive window to zero and try to shutdown(SHUT_RD)
the socket.
A log chain of if/else prevented many sanity checks from being
performed on TCP listeners, resulting in dangerous configs being
accepted. Removed the offending 'else'.
src/chtbl.c, src/hashpjw.c and src/list.c are distributed under
an obscure license. While Aleks and I believe that this license
is OK for haproxy, other people think it is not compatible with
the GPL.
Whether it is or not is not the problem. The fact that it rises
a doubt is sufficient for this problem to be addressed. Arnaud
Cornet rewrote the unclear parts with clean GPLv2 and LGPL code.
The hash algorithm has changed too and the code has been slightly
simplified in the process. A lot of care has been taken in order
to respect the original API as much as possible, including the
LGPL for the exportable parts.
The new code has not been thoroughly tested but it looks OK now.
Under some circumstances, it was possible with speculative I/O to
reallocate multiple entries for the same FD if an fd_{set,clr,set}
or fd_{clr,set,clr} sequences were performed before a schedule.
Fix this by keeping a an allocation flag for each fd.
The stats page now supports an option to hide servers which are DOWN
and to enable/disable automatic refresh. It is also possible to ask
for an immediate refresh.
The result of the vsnprintf() called in chunk_printf() must be checked,
and should be added only if lower than the requested size. We simply
return zero if we cannot write the chunk.
Sometimes it may be desirable to automatically refresh the
stats page. Most browsers support the "Refresh:" header with
an interval in seconds. Specifying "stats refresh xxx" will
automatically add this header.
The GCD used when computing the servers' weights causes the total
weight of the backend to appear lower than expected because it is
divided by the GCD. Easy solution consists in recomputing the GCD
from the first server and apply it to the global weight.
When a very large number of servers is configured (thousands),
shutting down many of them at once could lead to large number
of calls to recalc_server_map() which already takes some time.
This would result in an O(N^3) computation time, leading to
noticeable pauses on slow embedded CPUs on test platforms.
Instead, mark the map as dirty and recalc it only when needed.
Now we try to free as many pools as possible when a proxy is stopping.
The reason is that we want to ease the process replacement when applying
a new configuration, without keeping too many unused memory allocated.
Those ACLs are sometimes useful for troubleshooting. Two ACL subjects
"always_true" and "always_false" have been added too. They return what
their subject says for every pattern. Also, acl_match_pst() has been
removed.
The new "use_backend" keyword permits full content switching by the
use of ACLs. Its usage is simple :
use_backend <backend_name> {if|unless} <acl_cond>
Implemented the "-i" option on ACLs to state that the matching
will have to be performed for all patterns ignoring case. The
usage is :
acl <aclname> <aclsubject> -i pattern1 ...
If a pattern must begin with "-", either it must not be the first one,
or the "--" option should be specified first.
The deinit() function is specialized in memory area freeing.
There were a ton of information that were not released at the
exit time, which made valgrind complain. Now, most of the entries
are freed. However, it seems like regfree() does not completely
free a regex (12 bytes lost per regex).
since pools v2, the way pools were destroyed at exit is incorrect
because it ought to account for users of those pools before freeing
them. This test also ensures there is no double free.
It is now possible to read error messages from local files,
using the 'errorfile' keyword. Those files are read during
parsing, so there's no I/O involved. They make it possible
to return custom error messages with custom status and headers.
'path', 'path_reg', 'path_beg', 'path_end', 'path_sub', 'path_dir'
and 'path_dom' have been implemented to process the path component
of the URI. It starts after the host part, and stops before the
question mark.
hdr(x), hdr_reg(x), hdr_beg(x), hdr_end(x), hdr_sub(x), hdr_dir(x),
hdr_dom(x), hdr_cnt(x) and hdr_val(x) have been implemented. They
apply to any of the possibly multiple values of header <x>.
Right now, hdr_val() is limited to integer matching, but it should
reasonably be upgraded to match long long ints.
Some fetches such as 'line' or 'hdr' need to know the direction of
the test (request or response). A new 'dir' parameter is now
propagated from the caller to achieve this.
ACLs now support operators such as 'eq', 'le', 'lt', 'ge' and 'gt'
in order to give more flexibility to the language. Because of this
change, the 'dst_limit' keyword changed to 'dst_conn' and now requires
either a range or a test such as 'dst_conn lt 1000' which is more
understandable.
By default, epoll/kqueue used to return as many events as possible.
This could sometimes cause huge latencies (latencies of up to 400 ms
have been observed with many thousands of fds at once). Limiting the
number of events returned also reduces the latency by avoiding too
many blind processing. The value is set to 200 by default and can be
changed in the global section using the tune.maxpollevents parameter.
Recreate the epoll file descriptor after a fork(). It will ensure
that all processes will not share their epoll_fd. Some side effects
were encountered because of this, such as epoll_wait() returning an
FD which was previously deleted, in multi-process mode.
Compare the results of recv/send with the parameter passed and
detect whether the system has no free buffer space for send()
or has no data anymore for recv(). This dramatically reduces
the number of syscalls (by about 23%).
A second occurrence of read-timeout rearming was present in stream_sock.c.
To fix the problem, it was necessary to put the shutdown information in
the buffer (already planned).
There is a long-time bug causing busy loops when either client-side
or server-side enters a SHUTR state. When writing data to the FD,
it was possible to re-arm the read side if the write had been paused.
ETERNITY is not 0 anymore, so all timeouts will not be initialized
to ETERNITY by a simple calloc(). We have to explictly assign them.
This bug caused random session aborts.
wake_expired_tasks() used a hint to avoid scanning the tree in most cases,
but it looks like the hint is more expensive than reaching the first node
in the tree. Disable it for now.
Introduction of timeval timers broke *poll-based pollers, because the call to
tv_ms_remain may return 0 while the event is not elapsed yet. Now we carefully
check for those cases and round the result up by 1 ms.
When we're interrupted by another instance, it is very likely
that the other one will need some memory. Now we know how to
free what is not used, so let's do it.
Also only free non-null pointers. Previously, pool_destroy()
did implicitly check for this case which was incidentely
needed.
- keep the number of users of each pool
- call the garbage collector on out of memory conditions
- sort the pools by size for faster creation
- force the alignment size to 16 bytes instead of 4*sizeof(void *)
Also during this process, a bug was found in appsession_refresh().
It would not automatically requeue the task in the queue, so the
old sessions would not vanish.
It was possible in ev_sepoll() to ignore certain events if
all speculative events had been processed at once, because
the epoll_wait() timeout was not cleared, thus delaying the
events delivery.
The state machine was complicated, it has been rewritten.
It seems faster and more correct right now.
The timeout functions were difficult to manipulate because they were
rounding results to the millisecond. Thus, it was difficult to compare
and to check what expired and what did not. Also, the comparison
functions were heavy with multiplies and divides by 1000. Now, all
timeouts are stored in timevals, reducing the number of operations
for updates and leading to cleaner and more efficient code.
Two states were missing in the speculative epoll state transition
matrix. This could cause some timeouts and unhandled events. The
problem showed up in TCP mode with a fast server at high session
rates, but could in theory also affect HTTP mode.
The time subsystem really needs fixing. It was still possible
that some tasks with expiration date below the millisecond in
the future caused busy loop around poll() waiting for the
timeout to happen.
Peter van Dijk contributed this patch which implements the "smtpchk"
option, which is to SMTP what "httpchk" is to HTTP. By default, it sends
"HELO localhost" to the servers, and waits for the 250 message, but it
can also send a specific request.
The file client.c now provides acl_fetch_dip and acl_fetch_dport
to be able to check the client's destination address and port. The
corresponding ACL keywords 'dst' and 'dport' have been added.
The new 'block' keyword makes it possible to block a request based on
ACL test results. Block accepts two optional arguments : 'if' <cond>
and 'unless' <cond>.
The request will be blocked with a 403 response if the condition is validated
(if) or if it is not (unless). Do not rely on this one too much, as it's more
of a proof of concept helping in developing other matches.
This framework offers all other subsystems the ability to register
ACL matching criteria. Some generic matching functions are already
provided. Others will come soon and the framework shall evolve.
There are multiple places where the client's destination address is
required. Let's store it in the session when needed, and add a flag
to inform that it has been retrieved.
Problem reported by Andy Smith. If a client sends TCP data
and quickly closes the connection before the server connection
is established, AND the whole buffer can be sent at once when
the connection establishes, then the server side believes that
it can simply abort the connection because the buffer is empty,
without checking that some work was performed.
Fix: ensure that nothing was written before closing.
It is important when parsing configuration file to ensure that at
least one word is empty to mark the end of the line. This will be
required with ACLs in order to avoid reading past the end of line.
Since the introduction of speculative I/O, it was not always possible
to correctly detect a connection establishment. Particularly, in TCP
mode, there is no data to send and getsockopt() returns no error. The
solution consists in trying a connect() again to get its diagnostic.
tv_cmp2_ms handles multiple combinations of tv1 and tv2, but only
one form is used: (tv1 <= tv2). So it is overkill to use it everywhere.
A new function designed to do exactly this has been written for that
purpose: tv_cmp2_le. Also, removed old unused tv_* functions.
The fact that TV_ETERNITY was 0 was very awkward because it
required that comparison functions handled the special case.
Now it is ~0 and all comparisons are performed on unsigned
values, so that it is naturally greater than any other value.
A performance gain of about 2-5% has been noticed.
The rbtree-based wait queue consumes a lot of CPU. Use the ul2tree
instead. Lots of cleanups and code reorganizations made it possible
to reduce the task struct and simplify the code a bit.
The principle behind speculative I/O is to speculatively try to
perform I/O before registering the events in the system. This
considerably reduces the number of calls to epoll_ctl() and
sometimes even epoll_wait(), and manages to increase overall
performance by about 10%.
The new poller has been called "sepoll". It is used by default
on Linux when it works. A corresponding option "nosepoll" and
the command line argument "-ds" allow to disable it.
Gcc provides __attribute__((constructor)) which is very convenient
to execute functions at startup right before main(). All the pollers
have been converted to have their register() function declared like
this, so that it is not necessary anymore to call them from a centralized
file.
The pollers will now be able to speculatively call the I/O
processing functions and decide whether or not they want to
poll on those FDs. The changes primarily consist in teaching
those functions how to pass the info they got an EAGAIN.
Now fdtab can contain the FD_POLL_* events so that the pollers
which can fill them can give userful information to readers and
writers about the precise condition of wakeup.
It may be dangerous to play with fdtab before doing fd_insert()
because this last one is responsible for growing maxfd as needed.
Call fd_insert() before instead.
Some pollers such as kqueue lose their FD across fork(), meaning that
the registered file descriptors are lost too. Now when the proxies are
started by start_proxies(), the file descriptors are not registered yet,
leaving enough time for the fork() to take place and to get a new pollfd.
It will be the first call to maintain_proxies that will register them.
select, poll and epoll now have their dedicated functions and have
been split into distinct files. Several FD manipulation primitives
have been provided with each poller.
The rest of the code needs to be cleaned to remove traces of
StaticReadEvent/StaticWriteEvent. A trick involving a macro has
temporarily been used right now. Some work needs to be done to
factorize tests and sets everywhere.
Before calling http_parse_{sts,req}line(), it is necessary
to make msg->sol point to the beginning of the line. This
was not done, resulting in the proxy sometimes crashing when
URI rewriting or result rewriting was used.
logs are handled better with dedicated functions. The HTTP implementation
moved to proto_http.c. It has been cleaned up a bit. Now a frontend with
option httplog and no log will not call the function anymore.
The fiprm and beprm were added to ease the transition between
a single listener mode to frontends+backends. They are no longer
needed and make the code a bit more complicated. Remove them.
Patch from Fabrice Dulaunoy. Explanation below, and script
merged in examples/.
This patch allow to put a different address in the check part for each
server (and not only a specific port)
I need this feature because I've a complex settings where, when a specific
farm goes down, I need to switch a set of other farm either if these other
farm behave perfectly well.
For that purpose, I've made a small PERL daemon with some REGEX or PORT
test which allow me to test a bunch of thing.
Patch from Bryan Germann for 1.2.17.
In some circumstances, it is useful not to add the X-Forwarded-For
header, for instance when the client is another reverse-proxy or
stunnel running on the same machine and which already adds it. This
patch adds the "except" keyword to the "forwardfor" option, allowing
to specify an address or network which will not be added to this
header.
Patch from Marcus Rueckert for 1.2.17 :
"I added the attached patch to haproxy. I don't have a static uid/gid for
haproxy so i need to specify the username/groupname to run it as non
root user."
Previously, use of the "usesrc" keyword could silently fail if
either the module was not loaded, or the user did not have enough
permissions. Now the errors are better diagnosed and more appropriate
advices are given.
It was difficult to find how to enter the "usesrc" keyword. Now the
configuration checker is a bit more friendly and tries to identify
most mistakes and gives some hints back.
Generally, if a recv() returns less bytes than the MSS, it means that
there is nothing left in the system's buffers, and that it's not worth
trying to read again because we are very likely to get nothing. A
default read low limit has been set to 1460 bytes below which we stop
reading.
This has brought a little speed boost on small objects while maintaining
the same speed on large objects.
Multiple read polling was temporarily disabled, which had the side
effect of burning huge amounts of CPU on large objects. It has now
been re-implemented with a limit of 8 calls per wake-up, which seems
to provide best results at least on Linux.
Very recent changes consisting in moving some pointers to the
transaction instead of the session have lead to a bug because
those pointers were only initialized if the protocol was HTTP,
but they were freed based on their value. In some cases, it
was possible to cause double frees.
HTTP header matching is now made easier with http_header_match2().
Various locations have been adapted to use it. A small bug was also
fixed causing empty headers to be matched till next one.
Two new functions http_header_add_tail() and http_header_add_tail2()
make it easier to append headers, and also reduce the number of
sprintf() calls and perform stricter checks.
Some session flags were clearly related to HTTP transactions.
A new 'flags' field has been added to http_txn, and the
associated flags moved to proto_http.h.
Now the response is correctly processed in the backend first
then in the frontend. It has followed intensive tests to
catch regressions, and everything seems OK now, but the code
is young anyway.
Both request and response captures will have to parse headers following
the same methods. It's better to factorize the code, hence the new
capture_headers() function.
Some parts of HTTP processing were incorrectly called "request" while
they are messages or transactions. The following structure members
have changed :
http_msg.hdr_state => msg_state
http_msg.sor => som
http_req.req_state => removed
http_req => http_txn
The new rbtree-based scheduler makes heavy use of tv_cmp2(), and
this function becomes a huge CPU eater. Refine it a little bit in
order to slightly reduce CPU usage.
If captures were configured in a TCP-only listener, and
the logs were enabled, the proxy could segfault when
trying to scan the capture buffer which was NULL. Such
an erroneous configuration will not be possible anymore
soon, but let's avoid the problem for now by detecting
the NULL condition.
A missing pointer assignment in case of an empty header
will result in this header's length being 65535, causing
a SEGV when accessing the next header. It should not be
possible to exploit this problem to run arbitrary code
because the crash occurs while reading the data.
When a request is invalid during RQ_BEFORE AND the debug mode is active,
the hdr_idx might be used uninitialized. Let's initialize it right after
the accept() for now.
Since the request is no longer part of the headers, cookies and
authentication did not work anymore. Obvious fix is to add the
request offset to the start pointer.
- stats now support the HEAD method too
- extracted http request from the session
- huge rework of the HTTP parser which is now a 28-state FSM.
- linux-style likely/unlikely macros for optimization hints
- do not create a server socket when there's no server
The HTTP parser has been rewritten for better compliance to RFC2616.
The same parser is now usable for both requests and responses, and
it now supports HTTP/0.9 as well as multi-line headers. It has also
been improved for speed ; a typicial HTTP request is parsed in about
2 microseconds on a 1 GHz processor.
The monitor-uri check has been moved so that the requests are not
logged. The httpclose option now tries to change as little as
possible in the request, and does not affect the first header if
it is already set to 'close'. HTTP/0.9 requests are converted to
HTTP/1.0 before being forwarded.
Headers and request transformations are now distinct. The headers
list is updated after each insertion/removal/transformation. The
request is re-parsed and checked after each transformation. It is
not possible anymore to remove a request, and requests which lead
to invalid request lines are now rejected.
Since the distinction of backends and frontends, it has become
possible that some requests reach a frontend which has no
backend parameters. We must not create a socket on the backend
side just to destroy it later in such a case. The real problem
comes from the dispatch mode not being explictly stated.
A struct http_req has been created to collect every information
related to an HTTP request being processed. Right now, it is
still in the struct session but the frontier is clear now.
There are browsers which sometimes send HEAD requests to the stats
page, but it was not handled so it returned a 503 server error or
was simply sent to the default backend servers.
Now with a HEAD request, the stats return the headers and finish
there. Normally, other methods should be blocked so that the stats
page really catches the whole URI. Other methods would need to cause
a 405 Method not allowed to be returned.
When a server has no port specified and there is a check
enabled on it, the check is disabled because the port is
unknown. However, people expect the "listen" line to set
the check port just like it sets the server's port. Now,
if a port is specified in the listen or in the first bind
and nowhere else, it will be used for the checks as well.
This patch from Sin Yu makes use of an rbtree for the wait queue,
which will solve the slowdown problem encountered when timeouts
are heterogenous in the configuration. The next step will be to
turn maintain_proxies() into a per-proxy task so that we won't
have to scan them all after each poll() loop.
The tcp-splicing code has been merged, and a doc has been written.
A configuration example has been derived from the previous content
switching sample.
Some options will need some checks (or initializations) to be performed
before starting everything. The cfg_opts table has been extended to
allow storing of options-dependant checks.
Since the introduction of hdr_idx, session_free() had not
been updated to free the header ! It implied a consumption
of about 400 bytes per new session.
The stats page could not tell the difference between a FE and a BE.
It has been revamped to indicate all relevant information. The font
is also slightly smaller in order for all the info to fit into small
screens. The data output path has been greatly simplified to use
string chunks.
Most common options are now stored in an array which eases
the parsing and which also permits reporting of ignored
options depending on the proxy's capabilities (back/front).
The "httpclose" option affects both frontend and backend, so it
was logical to check for its presence at both places. A request
which traverses either a frontend or a backend with this option
set will have a "Connection: close" header appended.
The log format has been slightly updated to separately report the name
of the frontend and the name of the backend. The accept date has been
enhanced to report the millisecond. The number of remaining connections
has also been updated and their order reversed, to include the number
of connections on the frontend. The new log format is now :
- $1: IP:port
- $2: accept date in this format : [dd/mm/YYYY:HH:MM:SS.ttt]
- $3: frontend name
- $4: backend name '/' server name
- $5: req time '/' queue time '/' conn time '/' header time '/' total time
- $6: HTTP status code
- $7: number of bytes returned
- $8: captures (request)
- $9: captures (response)
- $10: completion flags
- $11: remaining conns on process '/' frontend '/' backend '/' server
- $12: srv queue size '/' backend queue size
- $13..: '"' full request '"'
The notion of capabilities has been added to the proxy so that we
know whether a proxy supports frontend, backend, or rulesets. Given
this, some parameters are optionnal, some are ignored with a warning
and others are forbidden. It is now possible to write valid two level
configs without binding to dummy address/ports.
The maxconn argument is used only for the listeners, and the
fullconn is used only for the backends. If unset, it inherits
maxconn's value which itself can inherit the default or the
global value (we might need to change this).
To solve the logging maze, it has been decided that the frontend
and nothing else will define how a session will be logged. It might
change in the future but at least this choice allows all sort of
fantasies.
It is now possible to define an errorloc in the backend as well as
in the frontend. The backend's will be used first, and if undefined,
then the frontend's will be used instead. If none is used, then the
original error messages will be used.
HTTP error messages were all specific cases handled by an IF.
Now they are all in an array so that it will be easier to add
new ones. Also, the return functions now use chunks as inputs
so that it should be easier to provide alternative return
messages if needed.
Sin Yu's patch to permit to change the proxy from a regex was merged
with little changes :
- req_cap/rsp_cap are not reassigned to the new proxy, they stay
attached to the frontend
- the actions have been renamed "reqsetbe" and "reqisetbe" for
"set BackEnd".
- the buffer is not reset after the switch, instead, the headers are
parsed again by the backend
- in Sin's patch, it was theorically possible to switch multiple times,
but the switching track was lost, making it impossible to apply
server responsesin the reverse order. Now switching is limited to
1 action (separation between frontend and backend) but the filters
remain.
Now it will be extremely easy to add other switching conditions, such
as host matching, URI matching, etc...
There's still a hard work to be done on the logs and stats.
The logs have become a real mess. It is now very hard to tell which
frontend/backend will impose its configuration for the logs. This
needs a complete rework but at least it should work.
The nbconn attribute in the proxies was not relevant anymore because
a frontend A may use backend B and both of them must account for their
respective connections. For this reason, there now are two separate
counters for frontend and backend connections.
The stats page has been updated to reflect the backend, but a separate
line entry for the frontend with error counts would be good.
Note that as of now, beconn may be higher than maxconn, because maxconn
applies to the frontend, while beconn may be increased due to sessions
passed from another frontend.
There was a confusion about the way to find filters and backend
parameters from sessions. The chaining has been changed between
the session and the proxy.
Now, a session knows only two proxies : one frontend (->fe) and
one backend (->be). Each proxy has a link to the proxy providing
filters and to the proxy providing backend parameters (both self
by default).
The captures (cookies and headers) have been attached to the
frontend's filters for now.
The uri_auth and the statistics are attached to the backend's
filters so that the uri can depend on a hostname for instance.
A proxy will be able to borrow parameters from another one.
In particular, the filters will be inheritable from another
proxy, and the backend parameters too.
The check of uri_auth is now in a separate function which is
checked after every backend switch, so that it will be possible
to have an uri_auth for the frontend and another one for the
backend.
Some while() constructs are not very efficient with gcc, yet they are
used to scan all the text in the start line and the headers. Replacing
them with more efficient (but ugly) loops provides a global gain of
about 2%, which is not bad at all !
The filters are now iterated for FE, FI, BE.
Some grey areas remain :
- uri_auth has been propagated to the backend, but in fact it
should be checked at every level (fe, fi, be), depending
where it is declared, and before the filters.
- the HTTP method and URI should be stored and propagated everywhere
they are used. For this, we would need to first apply filters to
be aware of filter changes which affect them.
- there seems to be no need anymore for hdr_idx[0] being empty.
It may contain the start line, which will slightly improve
performance and make the code easier to read.
The req_cap, hdr_state, hdr_idx, auth_hdr and req_line have been moved
to a dedicated hreq structure in the session. It makes is easier to
add HTTP-specific fields such as SOR (start of request) and EOF (end
of headers).
It also made it possible to fix two bugs introduced by last commit :
- end of headers not correctly detected
- hdr_idx not freed upon one specific error during session creation
When the backend side will be reworked, it should rely on a similar
structure.
The code is working again, but not as clean as it could be.
Many blocks should still move to dedicated functions. req->h
must be removed everywhere and updated everytime needed.
A few functions or macros should take care of the headers
during header insertion/deletion/change.
The new parser uses an FSM to strictly follow RFC2616.
Headers are indexed and parsed only once they're all available.
That way, complex regexes make more sense.
HTTP processing is now performed in several phases by calling
multiple functions, making the code cleaner and easier to read.
Note that req[i]pass does not work anymore because it would
require that we mark a header to be ignored. What is really
needed is to have the ability to add an exception to a matching
(match xx except yy).
Several bugs have been fixed in appsession during the conversion
to the new FSM (method length and recovery on malloc errors).
The code does build and work with the debug examples, but is
not usable yet to connect to anything as it does not forward
the requests yet.
This structure will consume 4 bytes per header to keep track of
headers within a request or a response without having to parse
the whole request for each regex. As it's not possible to allocate
only 4 bytes, we define a max number of HTTP headers. We set it
to (BUFSIZE+79)/80 so that 8kB buffers can contain 100 headers
(like Apache), resulting in 400 bytes dedicated to indexation,
or about 400/(2*8kB) ~= 2.4% of the memory usage.
This patch was added in 1.2.9 but was then incidentely reverted by
manipulation error when merging next patch (enforce max number of
conns). It's now merged again.
The references to the proxy from the session have been turned into
Frontend (fe), Filters (fi) and Backend (be). This should ease the
migration to the L7 switching features. Next step will be to kill
the struct proxy and have 3 independant structs instead, each
referenced from entities called listener, frontend, filters and
backend.
Using the cttproxy kernel patch, it's possible to bind to any source
address. It is highly recommended to use the 03-natdel patch with the
other ones.
A new keyword appears as a complement to the "source" keyword : "usesrc".
The source address is mandatory and must be valid on the interface which
will see the packets. The "usesrc" option supports "client" (for full
client_ip:client_port spoofing), "client_ip" (for client_ip spoofing)
and any 'IP[:port]' combination to pretend to be another machine.
Right now, the source binding is missing from server health-checks if
set to another address. It must be implemented (think restricted firewalls).
The doc is still missing too.
Some of the tv_* functions are called very often. Passing their
arguments as registers is quite faster. This can be disabled
by setting CONFIG_HAP_DISABLE_REGPARM.
As suggested by Markus Elfring, a few "const char *" have replaced
some "char *" declarations where a function is not expected to
modify a value. It does not change the code but it helps detecting
coding errors.
Markus Elfring suggested adding a few checks which were missing
after a bunch of getsockopt() and 2 strdup(). While those are
unlikely to fail where they are used, it makes the code cleaner.
Since the connection queueing was introduced, the "redispatch"
option could not cover the cases where a connection has been
refused by the server after having been marked "in progress".
The fix consists in doing a redispatch in the delayed connection
handling code.
Problem reported by Konrad Rzentarzewski.
It is now possible to tarpit connections based on regex matches.
The tarpit timeout is equal to the contimeout. A 500 server error
response is faked, and the logs show the status flags as "PT" which
indicate the connection has been tarpitted.
The timeouts, expiration timers and results are now stored in the buffers.
The timers will have to change a bit to become more flexible, and when the
I/O completion functions will be written, the connect_complete() will have
to be extracted from the write() function.
This makes it possible to relay SSL connections in pure TCP instances while
ensuring the remote end really receives our data eventhough intermediate
agents (firewalls, proxies, ...) might acknowledge the connection.
Too many problem reports were caused by missing timeouts. While
there has never been any default value since version 1.0, having
no timeout is abnormal in networked environments, and will lead
to various problems such as CLOSE_WAIT sockets accumulating and
nasty things like this. For this reason, it's better to annoy
the users until they fix their configs than letting them run
buggy configurations.
The files are now stored under :
- include/haproxy for the generic includes
- include/types.h for the structures needed within prototypes
- include/proto.h for function prototypes and inline functions
- src/*.c for the C files
Most include files are now covered by LGPL. A last move still needs
to be done to put inline functions under GPL and not LGPL.
Version has been set to 1.3.0 in the code but some control still
needs to be done before releasing.
It has been rewritten and now supports an initialization state. It now also
prevents from dumping stopped(disabled) listeners and it is possible to
specify a scope with a list of proxies that are allowed to be dumped from
the one being configured ('.' meaning "this one"). The 'stats' entry can
be configured from the 'defaults' instance and it is correctly flushed
from proxies which redefine it.
Right now it only validates the user/passwd according to a specified list,
and lets the user pass through the proxy if the authentication is OK, and
it refuses any invalid access with a 401 Unauthorized response.
* merged Alexander Lazic's and Klaus Wagner's work on application
cookie-based persistence. Since this is the first merge, this version is
not intended for general use and reports are more than welcome. Some
documentation is really needed though.