Jump to content

This is a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

SLO/Test Kitchen

From Wikitech

Status: approved

Organizational

The Experiment Platform team is building experimentation and instrumentation tools collectively known as Test Kitchen. This includes standardized and custom instruments and experiments. Test Kitchen enables controlled experimentation (A/B tests) across all Wikimedia properties. It provides standardized tools and processes to reduce experiment setup time from 10 weeks to 1 week, enable cross-wiki testing, support both logged-in and logged-out user testing, automate data collection and analysis, ensure compliance with privacy policies, and make experiment results eventually publicly accessible.

Service

This service is made up of parts running all over our infrastructure, which will be collectively known as Test Kitchen.

Teams

The Experiment Platform team is responsible for Test Kitchen.

Architectural

Environmental dependencies

Experimentation Lab consists of:

  • Test Kitchen UI (fka MPIC or xLab): A standalone configuration UI with configuration API running on Node JS
  • Test Kitchen SDKs:
    • PHP: Running on the MediaWiki application servers
    • JS: Sent to visitor's browsers via ResourceLoader
    • (future) Swift and Kotlin running in the native iOS and Android apps
  • Varnish vmod logic that manages Edge Unique cookies
  • EventGate customizations to handle data sent by Test Kitchen client libraries
  • Data pipelines coordinating the flow of instrumentation data from Kafka to HDFS and ultimately Superset dashboards, using Airflow
  • A Beta Cluster deployment that simulates the above as much as possible

Service dependencies

Hard Dependencies

  • Varnish - without it users won't be enrolled in some experiments so we would consider the system to be down
  • EventGate - receives requests to log instrumentation data, without it the experiments are costing us in user experience without gaining us data
  • Kafka "jumbo" (see link to specific EventGate cluster we use and its dependency on Kafka "jumbo")
  • MediaWiki core and extensions (these are reacting to experiments and enabling MW developers to know what UI to display):
    • ResourceLoader
    • WikimediaEvents
    • MetricsPlatform
      • EventLogging
      • EventStreamConfig

Soft Dependencies

  • Data Platform
  • Wikimedia Wikis
    • Beta cluster
    • Production


Client-facing

User documentation: Test Kitchen

Clients

  • Experiment Managers: configure instrumentation in Test Kitchen UI
  • Instrument Developers: program against the API provided by Test Kitchen SDKs
  • Product Analysts: work with data collected and transformed in the Data Platform
  • Varnish Experiment Configuration poller: regularly pulls configuration from Test Kitchen API to allow Varnish to act as an Experiment Enrollment Authority


Request Classes

1. Experiment Configuration: Creating / Reading / Updating / Deleting experiment configurations
2. Data Collection
  • Sending instrumentation events through EventGate
  • Processing user interaction data

Service Level Indicators (SLIs)

Request Class 1 : Experiment Configuration

Combined latency-availability SLI: The percentage of all application requests that complete within 1 second (1000 milliseconds) and receive a non-error response, defined as HTTP status code not 5XX. We would normally also consider HTTP 4XX responses as problematic, but our service is exposed to the public internet and we expect to get some unknown amount of 4XX from traffic we don't control. Nevertheless, if a client has 4XX problems, we commit to finding a way to monitoring that in the future and certainly address it if we find it.

Request Class 2 . Data Collection

Availability SLI. Let R be all requests to log experiment data via EventGate. Let this break down as:

R = S + Es + Ei + L

where:

  • S: requests that successfully produce to the intended Kafka topic
  • Es: requests that produce to the Error topic with a System-related error (invalid header, etc)
  • Ei: requests that produce to the Error topic because of invalid instrumentation data (schema validation problems). Note: noise generated accidentally or purposefully would also fall in this category. We are working to minimize noise, but do not commit to that as part of SLOs defined around this SLI.
  • L: requests that are lost on the network from the Test Kitchen SDKs to EventGate

We then have two SLIs: S / (S + Es) and S / (S + Ei)

(in the future we hope to track L as well, currently not feasible)

Operational

Monitoring

Request Class 1

Request Class 2

Request Class 3

N/A

Troubleshooting

See Test Kitchen/Troubleshooting .

Deployment

The EventGate and Test Kitchen UI services are deployed via Kubernetes. Details of the Test Kitchen UI deployment and deployment instructions can be found at Test Kitchen/Test Kitchen UI/Administration .

The MetricsPlatform MediaWiki extension is deployed via the MediaWiki deployment pipeline, which is maintained by Release Engineering.

Service Level Objectives

Request Class 1 : Experiment Configuration

Over a 90 day rolling window, 95% of application requests to Test Kitchen have HTTP status code not 5XX and latency of 1 second or less

Request Class 2 . Data Collection

Over a 90 day rolling window,

S / (S + Es) >= 99.9%

S / (S + Ei) >= 95%

Rules and tracking

For original implementation see data_platform.pp and recording_rules.yaml (implementation may be adjusted over time).

See https://slo.wikimedia.org/?search=xlab for SLO tracking.

Note that in configuration the rules for burn rate alerting (to be toggled on mid-November 2025) are expressed using a 4-week window, in keeping with current convention.