HTTP timeouts

Revision as of 10:34, 21 January 2020 by imported>Vgutierrez (Add connect timeout on ats-tls && ats-backend)
This page is an attempt of documenting the timeouts involved in a request performed by a user against a service behind WMF caching layer.

The entry point for a user could be nginx or ats-tls depending on the service and the cache node assigned to the user IP:

TLS termination layer SSL handshake timeout connect timeout (origin server) TTFB (origin server) successive reads (origin server) Keepalive timeout (client)
nginx (deprecated) 60 seconds (nginx default value) 10 seconds (nginx default value) 180 seconds 180 seconds (same config parameter as TTFB) 60 seconds
ats-tls 60 seconds 3 seconds 180 seconds 180 seconds 120 seconds

Currently a big difference between nginx and ats-tls can be found on how they handle POST requests. nginx buffers the whole request completely before relying it to the origin (varnish-frontend) while ats-tls doesn't buffer it and relays the connection to varnish-frontend as soon as possible. On nginx, the timeout to fulfil the POST body is 60 seconds between read operations, this is the default value and it isn't explicitly configured.

Our caching system is split in two layers (frontend and backend). There is one implementation of the frontend layer (varnish) and two implementations of the backend layer (varnish-be and ats-be).

caching layer connect timeout TTFB successive reads
varnish-frontend 3 secondstext / 5 secondsupload 65 secondstext / 35 secondsupload 33 secondstext / 60 secondsupload
ats-backend 10 seconds 180 seconds 180 seconds

After leaving the backend caching layer, the request reaches the appserver. Here are described the timeouts that apply to appservers and api:

layer request timeout
Nginx (TLS/ats-be requests) N/A (same timeouts as the nginx used for TLS termination)
Envoy(TLS/ats-be requests) 1 secondconnect timeout / 65 secondsroute timeout
Apache 202 seconds
PHP 201 secondsappservers / 201 secondsapi
Excimer 60 secondsGET / 200 secondsPOST

Note: Those timeouts might be larger than the ones on the caching layer, mainly to properly service internal clients