You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

HTTP timeouts: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Vgutierrez
m (remove mention to varnish-be)
imported>Krinkle
Line 1: Line 1:
This page is an attempt of documenting the timeouts involved in a request performed by a user against a service behind WMF caching layer.
{{Navigation Wikimedia infrastructure|expand=caching}}
This documents '''HTTP timeouts''' involved in a we requests from users to a service behind WMF traffic layers.


{{TOC|limit=2|clear=none}}


The entry point for a user could be nginx or ats-tls depending on the service and the cache node assigned to the user IP:
== TLS ==
 
The entry point for a user is ats-tls, which node depends on the service and user IP address:
{| class="wikitable"
{| class="wikitable"
|+
|+
Line 11: Line 15:
!successive reads (origin server)
!successive reads (origin server)
!Keepalive timeout (client)
!Keepalive timeout (client)
|-
|ats-tls
|[https://github.com/wikimedia/puppet/blob/91c1a976955b0b8e16d808aa2371f3f66c1e8f3e/hieradata/common/profile/trafficserver/tls.yaml#L33 60 seconds]
|[https://github.com/wikimedia/puppet/blob/765d39f66320a4def7adccaa8a63fc970e278eb0/hieradata/common/profile/trafficserver/tls.yaml#L140 3 seconds]
|[https://github.com/wikimedia/puppet/blob/765d39f66320a4def7adccaa8a63fc970e278eb0/hieradata/common/profile/trafficserver/tls.yaml#L145 180 seconds]
|[https://github.com/wikimedia/puppet/blob/765d39f66320a4def7adccaa8a63fc970e278eb0/hieradata/common/profile/trafficserver/tls.yaml#L145 180 seconds]
|[https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/common/profile/trafficserver/tls.yaml#L140 120 seconds]
|-
|-
|nginx (deprecated)
|nginx (deprecated)
Line 18: Line 29:
|180 seconds (same config parameter as TTFB)
|180 seconds (same config parameter as TTFB)
|[https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/modules/tlsproxy/manifests/localssl.pp#L102 60 seconds]
|[https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/modules/tlsproxy/manifests/localssl.pp#L102 60 seconds]
|-
|ats-tls
|[https://github.com/wikimedia/puppet/blob/91c1a976955b0b8e16d808aa2371f3f66c1e8f3e/hieradata/common/profile/trafficserver/tls.yaml#L33 60 seconds]
|[https://github.com/wikimedia/puppet/blob/765d39f66320a4def7adccaa8a63fc970e278eb0/hieradata/common/profile/trafficserver/tls.yaml#L140 3 seconds]
|[https://github.com/wikimedia/puppet/blob/765d39f66320a4def7adccaa8a63fc970e278eb0/hieradata/common/profile/trafficserver/tls.yaml#L145 180 seconds]
|[https://github.com/wikimedia/puppet/blob/765d39f66320a4def7adccaa8a63fc970e278eb0/hieradata/common/profile/trafficserver/tls.yaml#L145 180 seconds]
|[https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/common/profile/trafficserver/tls.yaml#L140 120 seconds]
|}
|}
Currently a big difference between nginx and ats-tls can be found on how they handle POST requests. nginx buffers the whole request completely before relying it to the origin (varnish-frontend) while ats-tls doesn't buffer it and relays the connection to varnish-frontend as soon as possible. On nginx, the timeout to fulfil the POST body is 60 seconds between read operations, this is the default value and it isn't explicitly configured.
Currently a big difference between nginx and ats-tls can be found on how they handle POST requests. nginx buffers the whole request completely before relying it to the origin (varnish-frontend) while ats-tls doesn't buffer it and relays the connection to varnish-frontend as soon as possible. On nginx, the timeout to fulfil the POST body is 60 seconds between read operations, this is the default value and it isn't explicitly configured.


Our caching system is split in two layers (frontend and backend). There is one implementation of the frontend layer (varnish) and one implementation of the backend layer (ats-be).
== Caching ==
 
Our caching system is split in two layers (frontend, and backend). There is one implementation of the frontend layer (varnish) and one implementation of the backend layer (ats-be).
 
{| class="wikitable"
{| class="wikitable"
|+
|+
Line 37: Line 44:
|-
|-
|varnish-frontend
|varnish-frontend
|[https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/role/common/cache/text.yaml#L413 3 seconds]<sup>text</sup> / [https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/role/common/cache/upload.yaml#L33 5 seconds]<sup>upload</sup>
|[https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/role/common/cache/text.yaml#L413 3 seconds] <sup>(text)</sup> / [https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/role/common/cache/upload.yaml#L33 5 seconds] <sup>(upload)</sup>
|[https://github.com/wikimedia/puppet/blob/a9b571595f0e97fe335e81a0b03d31e284271ec8/hieradata/role/common/cache/text.yaml#L414 65 seconds]<sup>text</sup> / [https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/role/common/cache/upload.yaml#L34 35 seconds]<sup>upload</sup>
|[https://github.com/wikimedia/puppet/blob/a9b571595f0e97fe335e81a0b03d31e284271ec8/hieradata/role/common/cache/text.yaml#L414 65 seconds] <sup>(text)</sup> / [https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/role/common/cache/upload.yaml#L34 35 seconds] <sup>(upload)</sup>
|[https://github.com/wikimedia/puppet/blob/a9b571595f0e97fe335e81a0b03d31e284271ec8/hieradata/role/common/cache/text.yaml#L415 33 seconds]<sup>text</sup> / [https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/role/common/cache/upload.yaml#L35 60 seconds]<sup>upload</sup>
|[https://github.com/wikimedia/puppet/blob/a9b571595f0e97fe335e81a0b03d31e284271ec8/hieradata/role/common/cache/text.yaml#L415 33 seconds] <sup>(text)</sup> / [https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/role/common/cache/upload.yaml#L35 60 seconds] <sup>(upload)</sup>
|-
|-
|ats-backend
|ats-backend
Line 46: Line 53:
|[https://github.com/wikimedia/puppet/blob/765d39f66320a4def7adccaa8a63fc970e278eb0/hieradata/common/profile/trafficserver/backend.yaml#L393 180 seconds]
|[https://github.com/wikimedia/puppet/blob/765d39f66320a4def7adccaa8a63fc970e278eb0/hieradata/common/profile/trafficserver/backend.yaml#L393 180 seconds]
|}
|}
== App server ==
{{see also|MediaWiki at WMF#Timeouts}}
After leaving the backend caching layer, the request reaches the appserver. Here are described the timeouts that apply to appservers and api:
After leaving the backend caching layer, the request reaches the appserver. Here are described the timeouts that apply to appservers and api:
{| class="wikitable"
{| class="wikitable"
|+
|+ As of March 2020
!layer
!layer
!request timeout
!request timeout
|-
|-
|Nginx (TLS/ats-be requests)
|Nginx (TLS)
|N/A (same timeouts as the nginx used for TLS termination)
|180 seconds <sup>(appserver, api, parsoid)</sup> / 1200 seconds <sup>(jobrunner)</sup> / 86400 seconds <sup>(videoscaler)</sup>.
Configured by<code>proxy_read_timeout</code>. Time to first byte. Wall clock time.
|-
|-
|Envoy(TLS/ats-be requests)
|Envoy (TLS/ats-be requests)
|[https://github.com/wikimedia/puppet/blob/bbc63d02c260e953f71dfd6535a0a67c4ad944a7/modules/envoyproxy/manifests/tls_terminator.pp#L68 1 second]<sup>connect timeout</sup> / [https://github.com/wikimedia/puppet/blob/production/modules/envoyproxy/manifests/tls_terminator.pp#L69 65 seconds]<sup>route timeout</sup>
|[https://github.com/wikimedia/puppet/blob/bbc63d02c260e953f71dfd6535a0a67c4ad944a7/modules/envoyproxy/manifests/tls_terminator.pp#L68 1 second] <sup>(connect timeout)</sup> / [https://github.com/wikimedia/puppet/blob/production/modules/envoyproxy/manifests/tls_terminator.pp#L69 65 seconds] <sup>(route timeout)</sup>
{{Outdated-inline}}
|-
|-
|Apache
|Apache
|[https://github.com/wikimedia/puppet/blob/bbc63d02c260e953f71dfd6535a0a67c4ad944a7/modules/mediawiki/templates/apache/apache2.conf.erb#L4 202 seconds]
|[https://github.com/wikimedia/puppet/blob/bbc63d02c260e953f71dfd6535a0a67c4ad944a7/modules/mediawiki/templates/apache/apache2.conf.erb#L4 202 seconds] <sup>(appserver, api, parsoid)</sup> / 1202 seconds <sup>(jobrunner)</sup> / 86402 seconds <sup>(videoscaler)</sup>.
Configured by <code>Timeout</code>. Entire request-response, including connection time. Wall clock time.
|-
|php-fpm
|[https://github.com/wikimedia/puppet/blob/bbc63d02c260e953f71dfd6535a0a67c4ad944a7/hieradata/role/common/mediawiki/appserver.yaml#L32 201 seconds] <sup>(appservers)</sup> / [https://github.com/wikimedia/puppet/blob/bbc63d02c260e953f71dfd6535a0a67c4ad944a7/hieradata/role/common/mediawiki/appserver/api.yaml#L32 201 seconds] <sup>(api)</sup> / 201 seconds <sup>(parsoid)</sup> / 86400 seconds <sup>(jobrunner, videoscaler)</sup>.
Configured by <code>profile::mediawiki::php::request_timeout</code>. Wall clock time.
|-
|-
|PHP
|PHP
|[https://github.com/wikimedia/puppet/blob/bbc63d02c260e953f71dfd6535a0a67c4ad944a7/hieradata/role/common/mediawiki/appserver.yaml#L32 201 seconds]<sup>appservers</sup> / [https://github.com/wikimedia/puppet/blob/bbc63d02c260e953f71dfd6535a0a67c4ad944a7/hieradata/role/common/mediawiki/appserver/api.yaml#L32 201 seconds]<sup>api</sup>
|210 seconds <sup>(appserver, api, parsoid)</sup> / 1200 seconds <sup>(jobrunner,</sup> <sup>videoscaler)</sup>.
Configured by <code>max_execution_time</code>. CPU time (not including syscalls and C functions from extensions).
|-
|-
|Excimer
|MediaWiki
|[https://github.com/wikimedia/operations-mediawiki-config/blob/dd2f06c71e82cef6a24c8325ede80c4847085f61/wmf-config/set-time-limit.php#L31 60 seconds]<sup>GET</sup> / [https://github.com/wikimedia/operations-mediawiki-config/blob/dd2f06c71e82cef6a24c8325ede80c4847085f61/wmf-config/set-time-limit.php#L29 200 seconds]<sup>POST</sup>
|60 seconds <sup>(GET)</sup> / 200 seconds <sup>(POST)</sup> / 200 seconds <sup>(jobrunner)</sup> / 86400 seconds <sup>(videoscaler)</sup>.
This is configured [https://github.com/wikimedia/operations-mediawiki-config/blob/HEAD/wmf-config/set-time-limit.php#L14 using php-excimer]
|}
|}
'''Note:''' Those timeouts might be larger than the ones on the caching layer, mainly to properly service internal clients
 
=== Notes ===
 
The app server timeouts might be larger than the ones on the caching layer, this is mainly to properly service internal clients.
 
; php-fpm
: The <code>request_timeout</code> setting the maximum time php-fpm will spend processing a request before terminating the worker process. This exists as a last-resort to kill PHP processes even if a long-running C function is not yielding to Excimer and/or if PHP raised <code>max_execution_time</code> at run-time.
; PHP
: The <code>max_execution_time</code> setting in php.ini measures CPU time (not wall clock time), and does not include syscalls.
: Note that unlike all other settings, for videoscalers this setting is far lower than the higher-level timeouts (20min vs 24h). This is a compromise to prevent regular jobs from being able to spend 24h on the CPU, which would be very unexpected (as they share the same php-fpm configuration). Videoscaling jobs are expected to spend most of their time transcoding videos, which happens through syscalls so this is fine.
; MediaWiki
: This is controlled by the <code>ExcimerTimer</code> interval value, in [https://github.com/wikimedia/operations-mediawiki-config/blob/HEAD/wmf-config/set-time-limit.php#L14 wmf-config/set-time-limit]. Upon reaching the timeout, [[mw:Excimer|php-excimer]] will throw a <code>WMFTimeoutException</code> exception once the current syscall returns.

Revision as of 01:35, 15 May 2020

This documents HTTP timeouts involved in a we requests from users to a service behind WMF traffic layers.

TLS

The entry point for a user is ats-tls, which node depends on the service and user IP address:

TLS termination layer SSL handshake timeout connect timeout (origin server) TTFB (origin server) successive reads (origin server) Keepalive timeout (client)
ats-tls 60 seconds 3 seconds 180 seconds 180 seconds 120 seconds
nginx (deprecated) 60 seconds (nginx default value) 10 seconds (nginx default value) 180 seconds 180 seconds (same config parameter as TTFB) 60 seconds

Currently a big difference between nginx and ats-tls can be found on how they handle POST requests. nginx buffers the whole request completely before relying it to the origin (varnish-frontend) while ats-tls doesn't buffer it and relays the connection to varnish-frontend as soon as possible. On nginx, the timeout to fulfil the POST body is 60 seconds between read operations, this is the default value and it isn't explicitly configured.

Caching

Our caching system is split in two layers (frontend, and backend). There is one implementation of the frontend layer (varnish) and one implementation of the backend layer (ats-be).

caching layer connect timeout TTFB successive reads
varnish-frontend 3 seconds (text) / 5 seconds (upload) 65 seconds (text) / 35 seconds (upload) 33 seconds (text) / 60 seconds (upload)
ats-backend 10 seconds 180 seconds 180 seconds

App server

After leaving the backend caching layer, the request reaches the appserver. Here are described the timeouts that apply to appservers and api:

As of March 2020
layer request timeout
Nginx (TLS) 180 seconds (appserver, api, parsoid) / 1200 seconds (jobrunner) / 86400 seconds (videoscaler).

Configured byproxy_read_timeout. Time to first byte. Wall clock time.

Envoy (TLS/ats-be requests) 1 second (connect timeout) / 65 seconds (route timeout)
Apache 202 seconds (appserver, api, parsoid) / 1202 seconds (jobrunner) / 86402 seconds (videoscaler).

Configured by Timeout. Entire request-response, including connection time. Wall clock time.

php-fpm 201 seconds (appservers) / 201 seconds (api) / 201 seconds (parsoid) / 86400 seconds (jobrunner, videoscaler).

Configured by profile::mediawiki::php::request_timeout. Wall clock time.

PHP 210 seconds (appserver, api, parsoid) / 1200 seconds (jobrunner, videoscaler).

Configured by max_execution_time. CPU time (not including syscalls and C functions from extensions).

MediaWiki 60 seconds (GET) / 200 seconds (POST) / 200 seconds (jobrunner) / 86400 seconds (videoscaler).

This is configured using php-excimer

Notes

The app server timeouts might be larger than the ones on the caching layer, this is mainly to properly service internal clients.

php-fpm
The request_timeout setting the maximum time php-fpm will spend processing a request before terminating the worker process. This exists as a last-resort to kill PHP processes even if a long-running C function is not yielding to Excimer and/or if PHP raised max_execution_time at run-time.
PHP
The max_execution_time setting in php.ini measures CPU time (not wall clock time), and does not include syscalls.
Note that unlike all other settings, for videoscalers this setting is far lower than the higher-level timeouts (20min vs 24h). This is a compromise to prevent regular jobs from being able to spend 24h on the CPU, which would be very unexpected (as they share the same php-fpm configuration). Videoscaling jobs are expected to spend most of their time transcoding videos, which happens through syscalls so this is fine.
MediaWiki
This is controlled by the ExcimerTimer interval value, in wmf-config/set-time-limit. Upon reaching the timeout, php-excimer will throw a WMFTimeoutException exception once the current syscall returns.