You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

# Performance/Runbook/WebPageReplay/Alert: Difference between revisions

imported>Phedenskog (Update names to include performance) |
imported>Phedenskog (Added links to rum/webpagetest) |
||

Line 15: | Line 15: | ||

=== Front end performance regression === | === Front end performance regression === | ||

# Go to the [https://grafana.wikimedia.org/d/2kP3FjAZk/webpagereplay-en-wikipedia-org-alerts?orgId=1 WebPageReplay alert Grafana dashboard] to see/verify the alert | # Go to the [https://grafana.wikimedia.org/d/2kP3FjAZk/webpagereplay-en-wikipedia-org-alerts?orgId=1 WebPageReplay alert Grafana dashboard] to see/verify the alert. | ||

#Go to the [https://grafana.wikimedia.org/d/IvAfnmLMk/page-drilldown?orgId=1&var-base=sitespeed_io&var-path=desktop&var-testtype=webpagereplay&var-group=en_wikipedia_org&var-page=_wiki_Barack_Obama&var-browser=chrome&var-connectivity=100&var-function=median&var-s3path=https:%2F%2Fsynthetic-tests-result-wikimedia.s3.amazonaws.com individual page dashboard] and use the zoom in on the regression. Try to find the time of the regression (+- 2 hours or something like that). Check all tested URLs and see if they all have the regression. | #Go to the [https://grafana.wikimedia.org/d/IvAfnmLMk/page-drilldown?orgId=1&var-base=sitespeed_io&var-path=desktop&var-testtype=webpagereplay&var-group=en_wikipedia_org&var-page=_wiki_Barack_Obama&var-browser=chrome&var-connectivity=100&var-function=median&var-s3path=https:%2F%2Fsynthetic-tests-result-wikimedia.s3.amazonaws.com individual page dashboard] and use the zoom in on the regression. Try to find the time of the regression (+- 2 hours or something like that). Check all tested URLs and see if they all have the regression. | ||

#Verify the regression on WebPageTest | #Verify the [[Performance/Runbook/WebPageTest/Alert|regression on WebPageTest]] and check if you can [[Performance/Runbook/RUM/Alert|see anything in the RUM data]] (that normally lags since we switch browser versions fast, and for users it takes time). | ||

##If you can't find anything in the other tools, check if its a [[Performance/Runbook/WebPageReplay/Alert#Browser regression|browser regression]] or a [[Performance/Runbook/WebPageReplay/Alert#Test server regression|test server regression]]. | ##If you can't find anything in the other tools, check if its a [[Performance/Runbook/WebPageReplay/Alert#Browser regression|browser regression]] or a [[Performance/Runbook/WebPageReplay/Alert#Test server regression|test server regression]]. | ||

#Check [[Server Admin Log]] to see if there's been a change that correlate to the regression. | #Check [[Server Admin Log]] to see if there's been a change that correlate to the regression. | ||

If you can verify that it is a regression, create a task in | If you can verify that it is a regression, create a Phabricator task in and include everything you know. Please take screenshots of the dashboards and include links. If you could identify the code change that caused the change, please include the team/person in the issue. | ||

=== Browser performance regression === | === Browser performance regression === |

## Latest revision as of 07:59, 29 October 2021

This is the runbook for **WebPageReplay alerts**.

## Meta

- Issue tracker (Phabricator): WebPageReplay
- Documentation: WebPageReplay

## WebPageReplay alert fired

Our WebPageReplay tests measures the front end performance of Wikipedia (using a WebPageReplay proxy). If an alert fires it can be caused by:

- A front end performance regression of Wikipedia
- A regression in the browser that is used for the test
- Instability on the server that runs the tests

### Front end performance regression

- Go to the WebPageReplay alert Grafana dashboard to see/verify the alert.
- Go to the individual page dashboard and use the zoom in on the regression. Try to find the time of the regression (+- 2 hours or something like that). Check all tested URLs and see if they all have the regression.
- Verify the regression on WebPageTest and check if you can see anything in the RUM data (that normally lags since we switch browser versions fast, and for users it takes time).
- If you can't find anything in the other tools, check if its a browser regression or a test server regression.

- Check Server Admin Log to see if there's been a change that correlate to the regression.

If you can verify that it is a regression, create a Phabricator task in and include everything you know. Please take screenshots of the dashboards and include links. If you could identify the code change that caused the change, please include the team/person in the issue.

### Browser performance regression

- Go the the dashboard for WebPageReplay tests
- Make sure the
*domain*,*page*and*browser*matches the alert that fired (=you are looking at the right data). - Zoom in using the time dropdown, use the last 24 hours or two days, make sure the regression happened within that time window
- Click on
*Show each tests*and wait a couple of seconds until you see the green vertical lines appearing on the graphs. - Hover the mouse on the green lines before the regression and after the regression. Hovering will show a screenshot of the test and what versions of sitespeed.io and browser that was used when the test was executed. It will look something like this: 20.3.0 - 95.0.4638.54 The first part is the sitespeed.io version and the second part is the browser.
- Verify that it is the exact same browser version before the regression and after the regression
- If the browser version differ, verify the regression on all tested URLs and check if you can see the same thing on the tests running without WebPageReplay.

If we can see that the browser caught the regression we can rollback the version running WebPageReplay (look at the changelog to see what sitespeed.io version that includes what browser version) to 100% verify the regression. If the regression is verified, you should create an upstream bug for the browser.

### Test server performance regression

If the regression is on emulated mobile, make sure the dashboard *type* is **emulatedMobile** and *Test type* is **webpagereplay** in the dashboard. The default links are for desktop.

- Check the standard deviation of the CPU benchmark it should be something like 1 ms.
- Look at the min/median/max values of the CPU benchmark.
- If the standard variation is high contact the performance team that need to deploy the tests on a another server.