You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Difference between revisions of "SLO/Varnish"

From Wikitech-static
< SLO
Jump to navigation Jump to search
imported>Ema
imported>Ema
 
Line 1: Line 1:
<mark>To get started, make a copy of this page, named after your service, as a subpage of [[SLO]]. (For example, "SLO/Foobaroid".) Also open the [[SLO/template_instructions|instructions]], which guide you in more detail through the steps outlined here. Answer the questions in your new copy, removing ''all'' the yellow text, in mark tags, as you go. (Save early and often.)</mark>
Status: '''approved'''
 
Status: '''draft''' <mark>(Replace with "approved" when the SLO is complete, agreed on by all responsible teams, and subject to quarterly reporting. You can still update it afterward, with all teams' agreement.)</mark>


== Organizational ==
== Organizational ==
<mark>[[SLO/template_instructions#Organizational|Instructions]]</mark>


=== Service ===
=== Service ===
Line 13: Line 10:


== Architectural ==
== Architectural ==
<mark>[[SLO/template_instructions#Architectural|Instructions]]</mark>


=== Environmental dependencies ===
=== Environmental dependencies ===
Line 39: Line 34:


== Client-facing ==
== Client-facing ==
<mark>[[SLO/template_instructions#Client-facing|Instructions]]</mark>


=== Clients ===
=== Clients ===
ATS-TLS - All the important traffic comes through the ats-tls daemon on the same machine to reach varnish-fe, so this is the singular definitional client (L4LB and then real humans are clients-of-clients).  All of the public users of virtually all public services are indirect clients that also depend on the Varnish layer functioning to get any service at all.
ATS-TLS - All the important traffic comes through the ats-tls daemon on the same machine to reach varnish-fe, so this is the singular definitional client (L4LB and then real humans are clients-of-clients).  All of the public users of virtually all public services are indirect clients that also depend on the Varnish layer functioning to get any service at all.
=== Request Classes ===
<mark>What are the request classes, and how is a request classified? If your service has only one request class, delete this section.</mark>


== Service Level Indicators (SLIs) ==
== Service Level Indicators (SLIs) ==
<mark>[[SLO/template_instructions#Write_the_service_level_indicators|Instructions]]</mark>
Fraction of requests spending less than 0.1 second processing time inside of varnish itself and without a varnish internal error.  A varnish internal error is a 5xx generated by Varnish (as opposed to an underlying backend service), which is not a 503 Fetch Error (failure to get a reply status from the backend service).
Fraction of requests spending less than 0.1 second processing time inside of varnish itself and without a varnish internal error.  A varnish internal error is a 5xx generated by Varnish (as opposed to an underlying backend service), which is not a 503 Fetch Error (failure to get a reply status from the backend service).


== Operational ==
== Operational ==
<mark>[[SLO/template_instructions#Operational|Instructions]]</mark>


=== Monitoring ===
=== Monitoring ===
<mark>How is the service monitored?</mark>
The service is made of a supervisor varnishd process responsible for starting a child. The latter handles actual traffic, the former handles administrative commands and restarts the child if it stops responding. There is an Icinga check ensuring that the child is responding to HTTP requests, as well as an additional check called '''varnish-frontend-check-child-start''' which raises a critical if the supervisor process had to restart its child since it began operating.


=== Troubleshooting ===
=== Troubleshooting ===
<mark>How complex is the service to troubleshoot?</mark>
In most cases of anomalous operation it is sufficient to restart the service and open a task for the Traffic team with a description of the symptoms as well as the varnishd crash log found in systemd journal (if any).


=== Deployment ===
=== Deployment ===
<mark>How is the service deployed?</mark>
The service is deployed by Puppet using the [[gerrit:plugins/gitiles/operations/puppet/+/refs/heads/production/modules/varnish/manifests/|varnish puppet module]].


== Service Level Objectives ==
== Service Level Objectives ==
<mark>[[SLO/template_instructions#Write_the_service_level_objectives|Instructions]]</mark>
The agreed-upon SLO is 99.9% of requests spending less than 0.1 second processing time inside of varnish itself and without a Varnish Internal error.
 
=== Realistic targets ===
<mark>What are the realistic targets for each SLI? Why?</mark>
 
=== Ideal targets ===
<mark>What are the ideal targets for each SLI? Why?</mark>
 
=== Reconciliation ===
<mark>Reconcile the realistic vs. ideal targets, documenting any decisions made along the way.</mark>
 
<mark>Once the SLO is final, consider collapsing the above three sections.</mark>
 
<mark>What are the agreed-upon SLOs, for each SLI and each request class?</mark>

Latest revision as of 11:23, 21 September 2021

Status: approved

Organizational

Service

This service is the Varnish daemons used as Frontend Caches as part of WMF’s production CDN edge infrastructure. This only covers the main varnish daemon (/usr/sbin/varnishd) in its capacity to accurately and functionally serve production HTTP requests in real time; it does not cover the various ancillary tools, binaries, statistics/logging mechanisms, etc. All production instances in all datacenters are covered, which means all of the hardware machines currently named cp[0-9]*.{site}.wmnet.

Teams

SRE/Traffic owns and operates this service layer, and additionally the same team is also responsible for all of the direct dependencies at both edges of this service: the L4LB and/or TLS termination layers in the outwards-facing direction, and the ATS backend caches in the inwards-facing direction. Therefore, while this service impacts many other teams and services, the responsibility for it is fairly clearly a single-team affair. Other SRE subteams additionally share the burden of on-call incident response for this service.

Architectural

Environmental dependencies

This service runs independently in all 5 datacenters, and also comprises two different clusters named text and upload. In any given DC, text and upload have identical hardware configurations and layouts. We can (should?) define SLOs both for the global aggregate of a cluster (which may also need to discount manually-depooled time windows for specific datacenters, as with the discussion above about L4LB depools within a DC?), and per-DC per-cluster. The physical characteristics differ per-DC as follows (these numbers are for a single cluster text or upload):

Datacenter DC Layout Cluster Machines Cluster Layout
eqiad 4 rows, multiple racks each 8 2 machines per row, each in a distinct rack
codfw 4 rows, multiple racks each 8 2 machines per row, each in a distinct rack
esams 1 row, 3 racks 8 Text - 3:3:2
Upload - 2:3:3
(machines in each of the 3 racks)
ulsfo 1 row, 2 racks 6 (upgrade to 8 coming in FY21-22) 3:3
eqsin 1 row, 2 racks 8 4:4

Service dependencies

Hard dependencies: None, I think, outside of standard things like server hardware, network, infra-layer software, etc.

We have a soft dependency on all the public-facing services we handle edge traffic for (e.g. Mediawiki "appservers" cluster, "api" cluster, restbase, etc, etc); we can offer limited functionality without any of them (cache hits for readonly things, internal-to-varnish redirect outputs, possibly varnish-generated "503" output pages), but full service in the way users expect is impossible without the underlying service(s).

Client-facing

Clients

ATS-TLS - All the important traffic comes through the ats-tls daemon on the same machine to reach varnish-fe, so this is the singular definitional client (L4LB and then real humans are clients-of-clients). All of the public users of virtually all public services are indirect clients that also depend on the Varnish layer functioning to get any service at all.

Service Level Indicators (SLIs)

Fraction of requests spending less than 0.1 second processing time inside of varnish itself and without a varnish internal error. A varnish internal error is a 5xx generated by Varnish (as opposed to an underlying backend service), which is not a 503 Fetch Error (failure to get a reply status from the backend service).

Operational

Monitoring

The service is made of a supervisor varnishd process responsible for starting a child. The latter handles actual traffic, the former handles administrative commands and restarts the child if it stops responding. There is an Icinga check ensuring that the child is responding to HTTP requests, as well as an additional check called varnish-frontend-check-child-start which raises a critical if the supervisor process had to restart its child since it began operating.

Troubleshooting

In most cases of anomalous operation it is sufficient to restart the service and open a task for the Traffic team with a description of the symptoms as well as the varnishd crash log found in systemd journal (if any).

Deployment

The service is deployed by Puppet using the varnish puppet module.

Service Level Objectives

The agreed-upon SLO is 99.9% of requests spending less than 0.1 second processing time inside of varnish itself and without a Varnish Internal error.