You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Bits.wikimedia.org/Varnish testing

From Wikitech-static
Jump to navigation Jump to search

Few notes:

  • We hit a bug where all threads are writing to acceptor pipe, but acceptor thread doesn't seem to pick that up
  • Originally thought as 2.6.24 kernel problem, a 2.6.32.3 was deployed, but still got same problem
  • This happens with both poll and epoll acceptors (managed to hit it much earlier with poll acceptor, may be coincidence)
  • Currently we are running:
    • varnish 2.0-branch, Feb15 build (standard configure options)
    • Following sysctl.conf additional changes loaded:
net.core.rmem_max=16777216
net.core.wmem_max=16777216
net.ipv4.tcp_rmem=4096 87380 16777216
net.ipv4.tcp_wmem=4096 65536 16777216
net.ipv4.tcp_fin_timeout = 3
net.core.netdev_max_backlog = 30000
net.ipv4.tcp_no_metrics_save=1
net.core.somaxconn = 262144
net.ipv4.tcp_syncookies = 0
net.ipv4.tcp_max_orphans = 262144
net.ipv4.tcp_max_syn_backlog = 262144
net.ipv4.tcp_synack_retries = 2
net.ipv4.tcp_syn_retries = 2
    • ulimit -s 128
    • ulimit -n 500000
    • varnishd -n /dev/shm -smalloc,1G -f /usr/local/etc/varnish/bits.vcl -T 127.0.0.1:6000 -w 2000 -a 0.0.0.0:80 -p thread_pool_add_delay=1 -p send_timeout=30 -p listen_depth=4096

Apparently when varnish reaches 100% cpu usage, accept thread gets into a state where it leaks worker threads. poll() reaches 100% cpu sooner than epoll, so effect visible way earlier. As sq1 is single-cpu/single-core, high ratio of context switches consumes way more CPU resources, than it would in multithreaded environment - this is why we saw much better scalability on multi-core machines.

Currently bits.pmtpa is being handled by db19, which is handling 14000 requests/s with 180% cpu load and ~100MByte/s traffic.

db19 currently is connected via bond0 over eth0/eth1 directly to core switch.

VCL:

backend default {
.host = "10.2.1.1";
.port = "80";
}

sub vcl_recv {
    if (req.request != "GET" && req.request != "HEAD") {
        /* We only deal with GET and HEAD by default */
        error 403 "this is readonly domain";
    }
    if (req.http.host != "bits.wikimedia.org") { 
	error 403 "bad bad, very bad request"; 
    }
    return (lookup);
}
sub vcl_error {
    set obj.http.Content-Type = "text/html; charset=utf-8";
    synthetic {"
<!DOCTYPE html>
<html>
  <head>
    <title>"} obj.status " " obj.response {"</title>
  </head>
  <body>
    <h1>Error "} obj.status " " obj.response {"</h1>
    <p>We didn't feel like serving it</p>
  </body>
</html>
"};
    return (deliver);
}