You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

SLO/HAProxy

From Wikitech-static
< SLO
Revision as of 16:53, 9 May 2022 by imported>Vgutierrez (→‎Service Level Indicators (SLIs): Initial content)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Status: draft

Organizational

Service

This service is the HAProxy daemons used as TLS terminator as part of WMF’s production CDN edge infrastructure. This only covers the main haproxy daemon (/usr/sbin/haproxy) in its capacity to accurately and functionally serve production HTTP requests in real time; it does not cover the various ancillary tools, binaries, statistics/logging mechanisms, etc. All production instances in all datacenters are covered, which means all of the hardware machines currently named cp[0-9]*.{site}.wmnet.

Teams

SRE/Traffic owns and operates this service layer, and additionally the same team is also responsible for all of the direct dependencies at both edges of this service: the L4LB in the outwards-facing direction, and the Varnish frontend caches in the inwards-facing direction. Therefore, while this service impacts many other teams and services, the responsibility for it is fairly clearly a single-team affair. Other SRE subteams additionally share the burden of on-call incident response for this service.

Architectural

Instructions

Environmental dependencies

This service runs independently in all 6 datacenters, and also comprises two different clusters named text and upload. In any given DC, text and upload have identical hardware configurations and layouts. We can (should?) define SLOs both for the global aggregate of a cluster (which may also need to discount manually-depooled time windows for specific datacenters, as with the discussion above about L4LB depools within a DC?), and per-DC per-cluster. The physical characteristics differ per-DC as follows (these numbers are for a single cluster text or upload):

Datacenter DC Layout Cluster Machines Cluster Layout
eqiad 4 rows, multiple racks each 8 2 machines per row, each in a distinct rack
codfw 4 rows, multiple racks each 8 2 machines per row, each in a distinct rack
esams 1 row, 3 racks 8 Text - 3:3:2
Upload - 2:3:3
(machines in each of the 3 racks)
ulsfo 1 row, 2 racks 8 4:4
eqsin 1 row, 2 racks 8 4:4
drmrs 1 row, 2 racks 8 4:4

Service dependencies

Hard dependencies: Varnish Frontend caches. HAProxy is only able to return error messages if varnish isn't reachable.

Client-facing

Instructions

Clients

Strictly speaking the L4LB layer would be the sole client of HAProxy. But considering L7 clients, every human and bot trying to reach Wikipedia or any other service hosted behind WMF's CDN acts a client of HAProxy

Service Level Indicators (SLIs)

Proxy combined latency-availability SLI: The percentage of all requests where the latency contributed by the proxy, excluding backend wait time, is within TBD milliseconds (excluding connection establishment and TLS handshake) and the request receives a non-error response or an error response originating at the backend. Errors originating at the proxy are measured as sessions with a termination state set to:

  • I (Internal Error)
  • R (Resource exhausted)
  • D (Session killed by HAProxy)
  • U (Session killed by HAProxy acting as a backup server)
  • K (Session actively killed by an admin operating on HAProxy
  • L (Session was locally processed by HAProxy and was not passed to a server)

More details about termination states can be found here

Operational

Instructions

Monitoring

How is the service monitored?

Troubleshooting

How complex is the service to troubleshoot?

Deployment

How is the service deployed?

Service Level Objectives

Instructions

Realistic targets

What are the realistic targets for each SLI? Why?

Ideal targets

What are the ideal targets for each SLI? Why?

Reconciliation

Reconcile the realistic vs. ideal targets, documenting any decisions made along the way.

Once the SLO is final, consider collapsing the above three sections.

What are the agreed-upon SLOs, for each SLI and each request class?