You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
HTTP timeouts: Difference between revisions
imported>Vgutierrez m (remove mention to varnish-be) |
imported>Krinkle (→App server: Fix broken link to set-time-limit.php file) |
||
(4 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
This | {{Navigation Wikimedia infrastructure|expand=caching}} | ||
This documents '''HTTP timeouts''' involved in a web request from end-users to a service behind WMF traffic layers. | |||
{{TOC|limit=2|clear=none}} | |||
The entry point for | == {{Anchor|TLS}}Frontend TLS == | ||
The entry point for external clients is ats-tls. Which of the "cp" hosts is routed through, depends on the service and end-user IP address: | |||
{| class="wikitable" | {| class="wikitable" | ||
!TLS termination layer | !TLS termination layer | ||
! | !TLS handshake timeout | ||
!connect timeout (origin server) | !connect timeout (origin server) | ||
!TTFB (origin server) | !TTFB (origin server) | ||
!successive reads (origin server) | !successive reads (origin server) | ||
!Keepalive timeout (client) | !Keepalive timeout (client) | ||
|- | |- | ||
|ats-tls | |ats-tls | ||
Line 26: | Line 22: | ||
|[https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/common/profile/trafficserver/tls.yaml#L140 120 seconds] | |[https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/common/profile/trafficserver/tls.yaml#L140 120 seconds] | ||
|} | |} | ||
Currently a big difference between | Currently a big difference between ats-tls and nginx (used previously for frontend TLS) is in how they handle POST requests. nginx buffered the whole request completely before relaying it to the origin (varnish-frontend) while ats-tls doesn't buffer it and relays the connection to varnish-frontend as soon as possible. On nginx, the timeout to fulfil the POST body was 60 seconds between read operations, this its default value and it isn't explicitly configured. | ||
== Caching == | |||
Our caching system is split in two layers (frontend, and backend). There is one implementation of the frontend layer (varnish) and one implementation of the backend layer (ats-be). | |||
{| class="wikitable" | {| class="wikitable" | ||
!caching layer | !caching layer | ||
!connect timeout | !connect timeout | ||
Line 37: | Line 35: | ||
|- | |- | ||
|varnish-frontend | |varnish-frontend | ||
|[https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/role/common/cache/text.yaml#L413 3 seconds]<sup>text</sup> / [https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/role/common/cache/upload.yaml#L33 5 seconds]<sup>upload</sup> | |[https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/role/common/cache/text.yaml#L413 3 seconds] <sup>(text)</sup> / [https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/role/common/cache/upload.yaml#L33 5 seconds] <sup>(upload)</sup> | ||
|[https://github.com/wikimedia/puppet/blob/a9b571595f0e97fe335e81a0b03d31e284271ec8/hieradata/role/common/cache/text.yaml#L414 65 seconds]<sup>text</sup> / [https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/role/common/cache/upload.yaml#L34 35 seconds]<sup>upload</sup> | |[https://github.com/wikimedia/puppet/blob/a9b571595f0e97fe335e81a0b03d31e284271ec8/hieradata/role/common/cache/text.yaml#L414 65 seconds] <sup>(text)</sup> / [https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/role/common/cache/upload.yaml#L34 35 seconds] <sup>(upload)</sup> | ||
|[https://github.com/wikimedia/puppet/blob/a9b571595f0e97fe335e81a0b03d31e284271ec8/hieradata/role/common/cache/text.yaml#L415 33 seconds]<sup>text</sup> / [https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/role/common/cache/upload.yaml#L35 60 seconds]<sup>upload</sup> | |[https://github.com/wikimedia/puppet/blob/a9b571595f0e97fe335e81a0b03d31e284271ec8/hieradata/role/common/cache/text.yaml#L415 33 seconds] <sup>(text)</sup> / [https://github.com/wikimedia/puppet/blob/1410c8aa6043d002aaf32ca49cdc4bd4c3434927/hieradata/role/common/cache/upload.yaml#L35 60 seconds] <sup>(upload)</sup> | ||
|- | |- | ||
|ats-backend | |ats-backend | ||
Line 46: | Line 44: | ||
|[https://github.com/wikimedia/puppet/blob/765d39f66320a4def7adccaa8a63fc970e278eb0/hieradata/common/profile/trafficserver/backend.yaml#L393 180 seconds] | |[https://github.com/wikimedia/puppet/blob/765d39f66320a4def7adccaa8a63fc970e278eb0/hieradata/common/profile/trafficserver/backend.yaml#L393 180 seconds] | ||
|} | |} | ||
== App server == | |||
{{see also|MediaWiki at WMF#Timeouts}} | |||
After leaving the backend caching layer, the request reaches the appserver. Here are described the timeouts that apply to appservers and api: | After leaving the backend caching layer, the request reaches the appserver. Here are described the timeouts that apply to appservers and api: | ||
{| class="wikitable" | {| class="wikitable" | ||
|+ | |+ As of March 2020 | ||
!layer | !layer | ||
!request timeout | !request timeout | ||
|- | |- | ||
| | |Envoy (TLS) | ||
| | |[https://github.com/wikimedia/puppet/blob/bbc63d02c260e953f71dfd6535a0a67c4ad944a7/modules/envoyproxy/manifests/tls_terminator.pp#L68 1 second] <sup>(connect timeout)</sup> / [https://github.com/wikimedia/puppet/blob/production/modules/envoyproxy/manifests/tls_terminator.pp#L69 65 seconds] <sup>(route timeout)</sup> | ||
{{Outdated-inline}} | |||
|- | |- | ||
| | |Apache | ||
|[https://github.com/wikimedia/puppet/blob/bbc63d02c260e953f71dfd6535a0a67c4ad944a7/modules/ | |[https://github.com/wikimedia/puppet/blob/bbc63d02c260e953f71dfd6535a0a67c4ad944a7/modules/mediawiki/templates/apache/apache2.conf.erb#L4 202 seconds] <sup>(appserver, api, parsoid)</sup> / 1202 seconds <sup>(jobrunner)</sup> / 86402 seconds <sup>(videoscaler)</sup>. | ||
Configured by <code>Timeout</code>. Entire request-response, including connection time. Wall clock time. | |||
|- | |- | ||
| | |php-fpm | ||
|[https://github.com/wikimedia/puppet/blob/bbc63d02c260e953f71dfd6535a0a67c4ad944a7/ | |[https://github.com/wikimedia/puppet/blob/bbc63d02c260e953f71dfd6535a0a67c4ad944a7/hieradata/role/common/mediawiki/appserver.yaml#L32 201 seconds] <sup>(appservers)</sup> / [https://github.com/wikimedia/puppet/blob/bbc63d02c260e953f71dfd6535a0a67c4ad944a7/hieradata/role/common/mediawiki/appserver/api.yaml#L32 201 seconds] <sup>(api)</sup> / 201 seconds <sup>(parsoid)</sup> / 86400 seconds <sup>(jobrunner, videoscaler)</sup>. | ||
Configured by <code>profile::mediawiki::php::request_timeout</code>. Wall clock time. | |||
|- | |- | ||
|PHP | |PHP | ||
| | |210 seconds <sup>(appserver, api, parsoid)</sup> / 1200 seconds <sup>(jobrunner,</sup> <sup>videoscaler)</sup>. | ||
Configured by <code>max_execution_time</code>. CPU time (not including syscalls and C functions from extensions). | |||
|- | |- | ||
| | |MediaWiki | ||
| | |60 seconds <sup>(GET)</sup> / 200 seconds <sup>(POST)</sup> / 1200 seconds <sup>(jobrunner)</sup> / 86400 seconds <sup>(videoscaler)</sup>. | ||
This is configured [https://github.com/wikimedia/operations-mediawiki-config/blob/9d7f0b70266549bdbdf02838948b7e6bc44d468e/wmf-config/CommonSettings.php#L428-L455 using php-excimer] | |||
|} | |} | ||
=== Notes === | |||
The app server timeouts might be larger than the ones on the caching layer, this is mainly to properly service internal clients. | |||
; php-fpm | |||
: The <code>request_timeout</code> setting the maximum time php-fpm will spend processing a request before terminating the worker process. This exists as a last-resort to kill PHP processes even if a long-running C function is not yielding to Excimer and/or if PHP raised <code>max_execution_time</code> at run-time. | |||
; PHP | |||
: The <code>max_execution_time</code> setting in php.ini measures CPU time (not wall clock time), and does not include syscalls. | |||
:Note that this is intentionally several seconds higher than the layers above and below because we generally want to avoid requests being stopped by this layer and prefer it to happen either earlier in MW or higher up in php-fpm. | |||
:This layer is not able to differentiate between HTTP methods (GET/POST) or virtual hostnames (jobrunner vs videoscaler). As such, it has to accomodate both. | |||
: For videoscalers this setting is actually lower than the surrounding layers (1200s/20min vs 86400s/24h). This is a compromise to prevent non-videoscaler jobs from being able to spend 24h on the CPU, which would be very unexpected. Regular jobrunners and videoscalers are forced to share the same php-fpm configuration. This is fine because while videoscaling jobs may use 24h to complete, they are expected to spend most of their time transcoding videos, which happens through syscalls that are not captured by PHP's cpu time. | |||
; MediaWiki | |||
: This is controlled by the <code>ExcimerTimer</code> interval value, in [https://github.com/wikimedia/operations-mediawiki-config/blob/HEAD/wmf-config/set-time-limit.php#L14 wmf-config/set-time-limit]. Upon reaching the timeout, [[mw:Excimer|php-excimer]] will throw a <code>WMFTimeoutException</code> exception once the current syscall returns. |
Latest revision as of 22:31, 22 April 2022
This documents HTTP timeouts involved in a web request from end-users to a service behind WMF traffic layers.
Frontend TLS
The entry point for external clients is ats-tls. Which of the "cp" hosts is routed through, depends on the service and end-user IP address:
TLS termination layer | TLS handshake timeout | connect timeout (origin server) | TTFB (origin server) | successive reads (origin server) | Keepalive timeout (client) |
---|---|---|---|---|---|
ats-tls | 60 seconds | 3 seconds | 180 seconds | 180 seconds | 120 seconds |
Currently a big difference between ats-tls and nginx (used previously for frontend TLS) is in how they handle POST requests. nginx buffered the whole request completely before relaying it to the origin (varnish-frontend) while ats-tls doesn't buffer it and relays the connection to varnish-frontend as soon as possible. On nginx, the timeout to fulfil the POST body was 60 seconds between read operations, this its default value and it isn't explicitly configured.
Caching
Our caching system is split in two layers (frontend, and backend). There is one implementation of the frontend layer (varnish) and one implementation of the backend layer (ats-be).
caching layer | connect timeout | TTFB | successive reads |
---|---|---|---|
varnish-frontend | 3 seconds (text) / 5 seconds (upload) | 65 seconds (text) / 35 seconds (upload) | 33 seconds (text) / 60 seconds (upload) |
ats-backend | 10 seconds | 180 seconds | 180 seconds |
App server
After leaving the backend caching layer, the request reaches the appserver. Here are described the timeouts that apply to appservers and api:
layer | request timeout | ||
---|---|---|---|
Envoy (TLS) | 1 second (connect timeout) / 65 seconds (route timeout)
| ||
Apache | 202 seconds (appserver, api, parsoid) / 1202 seconds (jobrunner) / 86402 seconds (videoscaler).
Configured by | ||
php-fpm | 201 seconds (appservers) / 201 seconds (api) / 201 seconds (parsoid) / 86400 seconds (jobrunner, videoscaler).
Configured by | ||
PHP | 210 seconds (appserver, api, parsoid) / 1200 seconds (jobrunner, videoscaler).
Configured by | ||
MediaWiki | 60 seconds (GET) / 200 seconds (POST) / 1200 seconds (jobrunner) / 86400 seconds (videoscaler).
This is configured using php-excimer |
Notes
The app server timeouts might be larger than the ones on the caching layer, this is mainly to properly service internal clients.
- php-fpm
- The
request_timeout
setting the maximum time php-fpm will spend processing a request before terminating the worker process. This exists as a last-resort to kill PHP processes even if a long-running C function is not yielding to Excimer and/or if PHP raisedmax_execution_time
at run-time. - PHP
- The
max_execution_time
setting in php.ini measures CPU time (not wall clock time), and does not include syscalls. - Note that this is intentionally several seconds higher than the layers above and below because we generally want to avoid requests being stopped by this layer and prefer it to happen either earlier in MW or higher up in php-fpm.
- This layer is not able to differentiate between HTTP methods (GET/POST) or virtual hostnames (jobrunner vs videoscaler). As such, it has to accomodate both.
- For videoscalers this setting is actually lower than the surrounding layers (1200s/20min vs 86400s/24h). This is a compromise to prevent non-videoscaler jobs from being able to spend 24h on the CPU, which would be very unexpected. Regular jobrunners and videoscalers are forced to share the same php-fpm configuration. This is fine because while videoscaling jobs may use 24h to complete, they are expected to spend most of their time transcoding videos, which happens through syscalls that are not captured by PHP's cpu time.
- MediaWiki
- This is controlled by the
ExcimerTimer
interval value, in wmf-config/set-time-limit. Upon reaching the timeout, php-excimer will throw aWMFTimeoutException
exception once the current syscall returns.