You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
This page is a work in progress!
If you did not measure it (well), it did not happen
Also, many of these ideas apply to any variable that you are measuring.
Averages versus Medians
Average is the most commonly used statistic, but also the most commonly mis-used. A mean or average is calculated by adding all values of the set and dividing by the number of values. Calculated in this way, a mean is heavily influenced by outliers.
Lesson: To understand the distribution of your data, you need to plot it.
The statistic you probably want when talking about performance is the "median". A median is simple the median value on a dataset when sequenced from lowest to highest. Median and a percentile 50 are the same thing.
<add example of data series with salaries, see mean and median and which one represents distribution best>
Latency is bimodal
We are all familiar with the normal distribution, but when looking at performance we look at latency values frequently and latency data is not normal. It's bimodal. Bimodal means that the distribution has two peaks. Mean, median and standard deviation values are of little use to describe such data.
Plotted, a bimodal distribution looks like this:
Separate modes in a data distribution can be caused by many factors. In the case of web network latency, these can be explained by cache hits and misses. That is why rather that just describing our latency data with a mean (percentile50) we need to look at the edges of the distribution, percentiles 90 and 99.
A percentile is the value on a dataset below which a specific percentage of values fall. Example: If we calculate percentiles on latency measures, a percentile 90 of 2.0 secs means that 90% of our users are seeing values below 2 secs (good!).
t-test is not meaningful
When comparing before and after results - like, say, latency data before you changed your site to HTTPS and after - be wary of using comparison methods like the t-test. These do not work well in "situations in which the control and treatment groups do not differ in mean, but only in some other way". 
Be wary of normalizing the distribution
A distribution of a variable x that is not normal can be "normalized" if we take log(x). We do not recommend to do this for latency data, as you are occluding important characteristics of the data, like the effect of caching on latency, which produces the modality.
Benchmark quality: Do you have enough data?
It is crucial to have enough data to assess whether the change we made had some effect in performance. If the amount of data we have is to small, we might just be seeing the effect of random variations. Chance has an enormous influence and you might just be wasting your time trying to give meaning to random variations. “People expect that a sequence of events generated by a random process will represent the essential characteristics of that process even when the sequence is short.”
How do we get enough data so our sample is statistically significant?
Statistical significance is somewhat of a dry topic, but there are rules of thumb that we can use. The advantage of performance testing is that in most instances we can sample as much data as we need and it is easy to sample repeatedly.
"Typically, it is fairly easy to add iterations to performance tests to increase the total number of measurements collected; the best way to ensure statistical significance is simply to collect additional data if there is any doubt about whether or not the collected data represents reality. Whenever possible, ensure that you obtain a sample size of at least 100 measurements from at least two independent tests."
Now, be aware (see below) that to calculate a percentile 90 or 99 you need more measures. Have in mind that to calculate percentiles a good rule of thumb is to have at least 100 samples for a percentile 50, 1000 for a percentile 90, 1000 for a percentile 99 and so on ...
Although there is no strict rule about how to decide which results are statistically similar without complex equations that call for huge volumes of data that commercially driven software projects rarely have the time or resources to collect, the following is a reasonable approach to apply if there is doubt about the significance or reliability of data after evaluating two test executions where the data was expected to be similar. Compare results from at least five test executions and apply the rules of thumb below to determine whether or not test results are similar enough to be considered reliable:
- If more than 20 percent (or one out of five) of the test-execution results appear not to be similar to the others, something is generally wrong with the test environment, the application, or the test itself.
- If a 90th percentile value for any test execution is greater than the maximum or less than the minimum value for any of the other test executions, that data set is probably not statistically similar.
- If measurements from a test are noticeably higher or lower, when charted side-by-side, than the results of the other test executions, it is probably not statistically similar."