Jump to content

This is a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Test Kitchen/Decision Records/Changing Sampling During Experiment

From Wikitech

Context and Problem Statement An important question is: Can I add wikis or increase the percentage of traffic that is in-sample for an experiment while the experiment is underway?

Decision Outcome While possible, we recommend against changing traffic allocation configuration. An experiment helps to answer the question "should we roll out this feature for 100% of all traffic on all wikis?" by connecting a hypothesis to experimental results. A great sample configuration optimizes the experiment's statistical power and generalizability of insights, while minimizing disruption of user experience and guarding against overloading our analytics system with volume of data. While there are reasons to tweak this configuration as the experiment unfolds, we currently advise against it. This is mainly due to current limitations of our automated analytics.

This is best explained with an example configuration and what would happen if it changed during the experiment. Suppose the experiment's duration is Friday June 20th through Friday June 27th:

English Wikipedia Estonian Wikipedia French Wikisource
Initial Config 0.1% 1%
Config Change Monday 23rd 0.1% 5% 5%

Some observations about this scenario:

  • French Wikisource won't have any weekend data, and it is possible that the feature has very different impact on weekends when traffic patterns shift (eg. mobile/desktop)
  • Estonian Wikipedia has less weekend data than weekday data
  • Data from English Wikipedia has relative differences to the other wikis

Some realities this kind of data can hide:

  • Feature performs very differently (poorly or well) on weekends on every wiki except English Wikipedia
  • Feature performs great on weekends except something distracting happened on English Wikipedia this weekend and the results of this experiment were very poor

Our automated analytics is not detailed enough to account for these biases in the data and basing decisions on the results reported in the dashboard may lead to incorrect conclusions. More detailed analysis is possible against the raw data, but would have to be a manual investigation. So, at our current experimentation maturity, keeping the configuration constant throughout the experiment is the best way to get trustworthy results . Our next priority as a team is enabling more detailed analysis so we can better investigate biases like these and others that are not related to our experiment configuration.

If you feel that you understand these caveats and you still need to change sampling configuration, then xLab will allow the changes.

Status: Draft

Author: Dan Andreescu

Deciders: Adam Baso, Santiago Faci, Julie van der Hoop, Clare Ming, Mikhail Popov, Sarai Sanchez, Sam Smith

Consulted:

Informed:

Date authored: 2025-06-17

Date decided:

Technical story: https://phabricator.wikimedia.org/T396650

Keywords: experimentation culture, sampling,