You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
There are several services that are load-balanced behind a proxy with HAProxy, at this point mostly databases. HAProxy is a mature, open-source, simple, not-crazily-full-of-features, but reliable and efficient ("just does the job") L3/L4 (tcp/ip) proxy, allowing to be used for both load balancing and automatic switchover.
However, it also provides [some] L7 functionalities for HTTP/HTTPS, and some L7 simple checks for things like MySQL. When more complex checks are needed, typical deployments setup an open ip socket that responds an HTTP code to control the proxy behaviour, for example: https://www.percona.com/doc/percona-xtradb-cluster/5.7/howtos/haproxy.html
Find out if a proxy is being actively used
Not all services are using a proxy as of today (24th Sept 2019), even though they have one ready to be used. To find out which proxies are in use you can run:
host m1-master ; host m2-master ; host m3-master ; host labsdb-analytics ; host labsdb-web
If the result points to a dbproxy, it means it is in use. If it points to a database, it means there is no proxy being used in front of it.
The typical server is configured like this (
/etc/haproxy/conf.d/*). Here it knows about a primary (DB master) a secondary (DB replication slave) but only one node is active at any timeː
listen mariadb 0.0.0.0:3306 mode tcp balance roundrobin option tcpka option mysql-check user haproxy server <%= @primary_name %> <%= @primary_addr %> check inter 3s fall 3 rise 99999999 server <%= @secondary_name %> <%= @secondary_addr %> check backup
If the primary fails health checks the backup is brought online. The rise 99999999 trick (about 10 years) means that the primary does not come back without human intervention, even if it starts passing HAProxy health checks again. This prevents flopping back and forth once a failure happens, something worse than just losing a node.
Now, this all sounds good, but there are still some catches:
- At present many misc slaves are still running read_only=1 so read traffic will fail over but writes will start to be blocked until a human verifies that the old master is properly dead and runs SET GLOBAL read_only=0;. Applications on m2 like gerrit, ieg, otrs, exim and scholarships will complain but remain semi-useful.
- Persistent connections like those from the eventlogging, bacula, phabricator, etc. did not failover nicely during trials, instead hitting a TCP timeout and causing just about as much annoyance (and backfilling) as having no HAProxy at all. This needs more research.
When a dbproxy complains, it will give a non-critical (will not page) error with:
<icinga-wm> PROBLEM - haproxy failover on dbproxy1010 is CRITICAL: CRITICAL check_failover servers up 1 down 1
You can check the status of a proxy by running as root from localhost:
$ echo "show stat" | socat unix-connect:/run/haproxy/haproxy.sock stdio # pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime, mariadb,FRONTEND,,,0,2,5000,361,60844,537174,0,0,0,,,,,OPEN,,,,,,,,,1,2,0,,,,0,1,0,1,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,, mariadb,labsdb1009,0,0,0,2,,226,40594,501669,,0,,0,0,0,0,DOWN,1,1,0,3,1,138,138,,1,2,1,,226,,2,0,,1,L4TOUT,,3000,,,,,,,0,,,,2,0,,,,,138,,,0,0,0,7658, mariadb,labsdb1010,0,0,0,1,,135,20250,35505,,0,,0,0,0,0,UP,1,0,1,0,0,1543670,0,,1,2,2,,135,,2,1,,1,L7OK,0,0,,,,,,,0,,,,0,0,,,,,1,5.5.5-10.1.19-MariaDB,,0,0,0,1, mariadb,BACKEND,0,0,0,2,500,361,60844,537174,0,0,,0,0,0,0,UP,1,0,1,,0,1543670,0,,1,2,0,,361,,1,1,,1,,,,,,,,,,,,,,2,0,0,0,0,0,1,,,0,0,0,5882,
Here we see that labsdb1009 has gone down, and labsdb1010 is now the server serving the proxy backend, which is still up.
So for the present, if a dbproxyXXX complains:
- Check that original master is really down. If not, restart haproxy on dbproxy1002 and figure out why health checks failed.
- If the master is fubar ensure its mysqld is stopped before setting read_only=0 on the slave.
- If the slave is fubar most apps probably don't care, so do nothing.
if you run
sudo systemctl reload haproxy, because the original server has recovered, you will now get:
$ echo "show stat" | socat unix-connect:/run/haproxy/haproxy.sock stdio # pxname,svname,qcur,qmax,scur,smax,slim,stot,bin,bout,dreq,dresp,ereq,econ,eresp,wretr,wredis,status,weight,act,bck,chkfail,chkdown,lastchg,downtime,qlimit,pid,iid,sid,throttle,lbtot,tracked,type,rate,rate_lim,rate_max,check_status,check_code,check_duration,hrsp_1xx,hrsp_2xx,hrsp_3xx,hrsp_4xx,hrsp_5xx,hrsp_other,hanafail,req_rate,req_rate_max,req_tot,cli_abrt,srv_abrt,comp_in,comp_out,comp_byp,comp_rsp,lastsess,last_chk,last_agt,qtime,ctime,rtime,ttime, mariadb,FRONTEND,,,0,1,5000,3,450,789,0,0,0,,,,,OPEN,,,,,,,,,1,2,0,,,,0,1,0,1,,,,,,,,,,,0,0,0,,,0,0,0,0,,,,,,,, mariadb,labsdb1009,0,0,0,1,,3,450,789,,0,,0,0,0,0,UP,1,1,0,0,0,3,0,,1,2,1,,3,,2,1,,1,L7OK,0,0,,,,,,,0,,,,0,0,,,,,1,5.5.5-10.1.19-MariaDB,,0,0,0,1, mariadb,labsdb1010,0,0,0,0,,0,0,0,,0,,0,0,0,0,UP,1,0,1,0,0,3,0,,1,2,2,,0,,2,0,,0,L7OK,0,0,,,,,,,0,,,,0,0,,,,,-1,5.5.5-10.1.19-MariaDB,,0,0,0,0, mariadb,BACKEND,0,0,0,1,500,3,450,789,0,0,,0,0,0,0,UP,1,1,1,,0,3,0,,1,2,0,,3,,1,1,,1,,,,,,,,,,,,,,0,0,0,0,0,0,1,,,0,0,0,1,
<icinga-wm> RECOVERY - haproxy failover on dbproxy1010 is OK: OK check_failover
(all servers are back)
Two annoying particularities of haproxy:
- HAProxy, as of this date, doesn't automatically read all configuration files inside `/etc/haproxy/conf.d`, this is workarounded by a preexecution systemd script (
generate_haproxy_default.shthat reads all files present there and generates a manual command line on
/etc/default/haproxy. This can be misleading if you delete configuration files on puppet but not physically.
- HAProxy is able to reload (and reset its status) in a clean (and recommended way) by using
reload. However, if you change the config file names or its number, you need to
restartthe service (which is fast, but drops connections. This is a limitation of HAProxy itself and not our puppetization.
Other interesting commands
$ echo "show info" | socat unix-connect:/run/haproxy/haproxy.sock stdio Name: HAProxy Version: 1.5.8 Release_date: 2014/10/31 Nbproc: 1 Process_num: 1 Pid: 25297 Uptime: 0d 0h59m29s Uptime_sec: 3569 Memmax_MB: 0 Ulimit-n: 4033 Maxsock: 4033 Maxconn: 2000 Hard_maxconn: 2000 CurrConns: 0 CumConns: 330 CumReq: 330 MaxSslConns: 0 CurrSslConns: 0 CumSslConns: 0 Maxpipes: 0 PipesUsed: 0 PipesFree: 0 ConnRate: 0 ConnRateLimit: 0 MaxConnRate: 1 SessRate: 0 SessRateLimit: 0 MaxSessRate: 1 SslRate: 0 SslRateLimit: 0 MaxSslRate: 0 SslFrontendKeyRate: 0 SslFrontendMaxKeyRate: 0 SslFrontendSessionReuse_pct: 0 SslBackendKeyRate: 0 SslBackendMaxKeyRate: 0 SslCacheLookups: 0 SslCacheMisses: 0 CompressBpsIn: 0 CompressBpsOut: 0 CompressBpsRateLim: 0 ZlibMemUsage: 0 MaxZlibMemUsage: 0 Tasks: 6 Run_queue: 1 Idle_pct: 100 node: dbproxy1010 description: