You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Performance/Regressions: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Krinkle
imported>Krinkle
 
(5 intermediate revisions by 2 users not shown)
Line 1: Line 1:
We have different tools to find performance regressions and we have a automated alerts that will fire if they suspect there is a regression. When an alert is fired, we need to find out the cause of the regression. There are two different types of performance alerts: synthetic testing and real user measurements. Synthetic testing can find smaller regressions by analyzing a video recording of the screen (but only tests a few use cases) and real user measurements find larger regressions reported using browsers performance APIs that affects many users.
#REDIRECT [[Performance/Guides/Regressions]]
 
== You got a performance alert, what's the next step? ==
You want to understand what's causing the regression: Is it a code change, is it something in the environment, is it a new browser version or has something changed in the toolchain measuring performance?
 
The first thing I do is to try and find out if the regression is across the board (for all URLs, all browsers, all synthetic tools, both synthetic and RUM metrics). If you know that, you are on the way to finding the root cause of the problem.
 
== Synthetic testing ==
 
Synthetic testing alerts typically reference '''WebPageTest''' or '''WebPageReplay'''.  For example:
    Notification Type: PROBLEM
   
    Service: https://grafana.wikimedia.org/dashboard/db/webpagereplay-mobile-alerts grafana alert
    Host: einsteinium
    Address: 208.80.155.119
    State: CRITICAL
   
    Date/Time: Tue Sept 11 22:14:46 UTC 2018
   
    Notes URLs:
   
    Additional Info:
   
    CRITICAL: https://grafana.wikimedia.org/dashboard/db/webpagereplay-mobile-alerts is alerting: Rendering Mobile enwiki CPU alert.
or
    Notification Type: PROBLEM
   
    Service: https://grafana.wikimedia.org/dashboard/db/webpagetest-alerts grafana alert
    Host: einsteinium
    Address: 208.80.155.119
    State: CRITICAL
   
    Date/Time: Thu Sept 13 04:12:19 UTC 2018
   
    Notes URLs: https://phabricator.wikimedia.org/T203485
   
    Additional Info:
   
    CRITICAL: https://grafana.wikimedia.org/dashboard/db/webpagetest-alerts is alerting: Start Render Chrome Desktop [ALERT] alert.
 
=== Background ===
We run two different synthetic testing tools to find regressions: [[Performance/WebPageTest|WebPageTest]] includes network/server time, [[Performance/WebPageReplay|Browsertime/WebPageReplay]] focuses exclusively on front end performance. We run WebPageTest for English Wikipedia (desktop and mobile) and Browsertime/WebPageReplay for English, ''Swedish, French, Dutch, German, Spanish, Japanese, Chinese, Russian, beta, group 0 and group 1'' (desktop and mobile).
 
You can read more about the [[Performance/WebPageReplay/Alerts|WebPageReplay alerts]] to get the understanding on what we test.
 
=== Useful dashboards ===
If the alert comes from WebPageTest, you can start by checking the generic WebPageTest dashboard: https://grafana.wikimedia.org/dashboard/db/webpagetest and then go down and check the metrics for the individual URL: https://grafana.wikimedia.org/dashboard/db/webpagetest-drilldown
 
If the alert is coming from WebPageReplay/Browsertime you should start with the generic dashboard: https://grafana.wikimedia.org/dashboard/db/webpagereplay and then check each URL https://grafana.wikimedia.org/dashboard/db/webpagereplay-drilldown
 
=== Where to start ===
A good starting point is to find out at what point in time the regression was introduced. If you can find that, then you can compare screenshots and [http://www.softwareishard.com/blog/har-12-spec/ HAR] files (that describes what and when the browser downloads assets) before and after the regression.
 
==== WebPageTest ====
To find specific runs in WebPageTest, you need to use the [http://wpt.wmftest.org/testlog.php?days=7&filter=&all=on&nolimit=on search page]. It will show a lot of runs so make sure you pick the right ones!
 
A couple of things to know: Make sure you choose  '''Show tests from all users''' and '''Do not limit the number of results (warning, WILL be slow)'''. That way you are sure you will see all the tests. Also change the '''View''' to include enough days to go back to when the regression happened.
[[File:Webpagetest search form.png|thumb|220x220px|WebPageTest search form|alt=|none]]
 
You can also the fields or '''URLs containing''' and try to limit the result.
 
It's important that you get the run before and after the regression within the same search result, because you can use the small checkbox to the left of the results to pick runs. It's usually a lot of work to just find the right run so have patience. When you've picked to runs, then click the (small) '''Compare''' button.
[[File:WebPageTest search result.png|thumb|Search result with check boxes and compare button.|alt=|none]]
 
When you click "compare", you will see a comparison of the waterfall chart (using the HAR) and screenshots and videos for the selected runs.
 
Some things to look for:
 
* Are there assets that are being downloaded after the regression, that were not being downloaded before it?
* Are there specific assets that are downloading slowly?
* Has anything visible changed on the page?  (For example, we frequently have alerts fire when fundraising campaigns start, and we sometimes see alerts when an edit is made to a page that changes it significantly.)
 
==== Browsertime/WebPageReplay ====
To find specific runs, you need do to go to the storage where we store all data for all the runs. The easiest way to do that is to use the [https://grafana.wikimedia.org/dashboard/db/webpagereplay-drilldown?orgId=1 Grafana dashboard].
 
In the drop downs, make sure you pick the wiki, device, browser, latency and page you want to compare.
[[File:Webpagereplay choose page in Grafana.png|none|thumb|Choose page using the drop downs]]
When the page has refreshed itself, the links to the storage have been updated. Check to the right of the dashboard and you will see a screenshot of the page and two links: '''Latest run''' and '''Older runs'''. If the latest run includes the regression (the regression is still ongoing) you can click that link and a compare page will open with all the metrics from the latest run.
[[File:Latest run and older runs for WebPageReplay.png|none|thumb|Use the links to get to the result.]]
The next step is to find a run without the regression. You probably saw that already when looking at the graphs, so go back and remember the date and time just before the regression and then use the '''Older runs''' link.
[[File:Webpagereplay folders.png|none|thumb|Each run has it own date and time folder.]]
There you will see date folders, scroll down to find your date and time and click on that folder. Then you will see a list of the data collected for that run: Screenshots, videos, HAR file (and if you use Chrome a list of trace logs that you can drag and drop into devtools). Choose the HAR file (''browsertime.har.gz'') and download it to your desktop.
 
The next step is to go to the tab again where you open the compare page. Choose one of the upload buttons and upload your newly downloaded HAR file.
[[File:HAR upload button.png|none|thumb|Upload you HAR by choosing one of the upload buttons.]]
Now you will see the both HARs (check the waterfall), screenshots, video and summaries of the two that hopefully can help you spot differences.
 
=== Tips and tricks ===
Check the screenshots (the easiest way is to go to https://grafana.wikimedia.org/dashboard/db/webpagereplay-drilldown). Look out for campaigns and try to correlate them to when they got activated. You can also find screenshots (and videos) http://webpagereplay-wikimedia.s3-website-us-east-1.amazonaws.com/ for WebPageReplay (you will find direct links on the [https://grafana.wikimedia.org/dashboard/db/webpagereplay-drilldown dashboard]) or for WebPageTest http://wpt.wmftest.org/testlog.php?days=1&filter=&all=on&nolimit=on
 
Check if there has been any release for the tool (using WebPageReplay make sure you click '''Show WebPageReplay changes''' and for WebPageTest '''Show WebPageTest changes'''. If the performance team updates the tool (new version of the tool, new version of the browser) there will be an annotation for that. It has happened that new browser versions have introduced a regression. '''WARNING''': We still autoupdate WebPageTest, so it can happen that we miss an annotation for a browser upgrade or change in the tool.
 
Check if there is a release that correlates to the change by choosing '''Show sync-wikiversions''' and check the [[Server Admin Log|server admin log]].
 
Do you see any changes in the [https://grafana.wikimedia.org/dashboard/db/navigation-timing Navigation Timing metrics]? It's always good to try verify the change in both our ways of collecting metrics.
[[File:WebPageTest trace log.png|thumb|Where to find the Chrome trace log in WebPageTest]]
If the tests are run by Chrome we collect the internal trace log (both on WebPageTest and Browsertime/WebPageReplay) that you can use to dig deeper into what happens. For WebPageTest, you find the log (to download) using the '''Trace''' link. For Browsertime/WebPageReplay, the log for each run is in the result directory. Download the files, unpack them and drag and drop them into ''Developer Tools/Performance'' in Chrome''.''
 
== Real user measurement ==
The real user measurements are metrics that we collect from real users, using browsers APIs. Historically these metrics have been more technical than those collected by synthetic testing, as we can't get visual measures from the user's browser.
 
Alerts that derive from Real User Measurement data will typically reference Navigation Timing in the alert.  For example:
    Notification Type: PROBLEM
   
    Service: https://grafana.wikimedia.org/dashboard/db/navigation-timing-alerts grafana alert
    Host: einsteinium
    Address: 208.80.155.119
    State: CRITICAL
   
    Date/Time: Fri Aug 31 05:02:38 UTC 2018
   
    Notes URLs:
   
    Additional Info:
   
    CRITICAL: https://grafana.wikimedia.org/dashboard/db/navigation-timing-alerts is alerting: Load event overall median.
 
=== Background ===
The real user measurement metrics collect data from all browsers that support the [https://www.w3.org/TR/navigation-timing/ Navigation Timing API] . It also collects additional metrics like first paint (when something first is displayed on the screen), or the effective connection type, when the browser supports those additional APIs. We sample the data and use 1 out of 1000 requests by default. This can be overridden for specific geographies, pages, etc. where the sampling rate might be different.
 
=== Useful dashboards ===
The main Navigation Timing dashboards are a good way to start, with the [https://grafana.wikimedia.org/dashboard/db/navigation-timing-alerts alert dashboard] and the [https://grafana.wikimedia.org/dashboard/db/navigation-timing generic one].
 
=== Where to start ===
Start with the [https://grafana.wikimedia.org/dashboard/db/navigation-timing-alerts alert dashboard] to verify the alert. Then head over to https://grafana.wikimedia.org/dashboard/db/navigation-timing and check the metric that caused the alert (first paint, responseStart, loadEventEnd) and try to identify how big the issue is (is it causing other metrics to increase? check different percentiles and different metrics to try to understand what has changed).
 
At this level the metrics are collected for both '''mobile''' and '''desktop''' and grouped together. Go to https://grafana.wikimedia.org/dashboard/db/navigation-timing-by-platform to see the metrics grouped per type. There can be a big difference between both.
 
Then check '''Show sync-wikiversions''' along with the [[Server Admin Log]] to see if any change has been made at the time of the regression.
[[File:Sync wiki versions showing when a change is pushed.png|none|thumb|Sync wiki versions showing when a change is pushed]]
 
=== Tips and tricks ===
If you cannot find what caused the regression you can try the [https://grafana.wikimedia.org/dashboard/db/navigation-timing-by-browser Navigation Timing by browser] dashboard. [https://grafana.wikimedia.org/dashboard/db/navigation-timing-by-browser?refresh=5m&panelId=6&fullscreen&orgId=1 Check the report rate], has it changed? It could be that we did a release and accidentally changed how we collect the metrics or a new browser version rolled out that effect the metrics. You can see how many metric we collect for specific browser versions.
 
Do you see any change in the synthetic metrics? Use both tools to try to nail down the regression. The other tools can easier show you what has changed (by checking HAR from before and after the change).
 
It's possible that further drilling down is required and you may need to slice the data by other features than platform, browser or geography. For this, it's best to use [[Analytics/Systems/Cluster/Hive|hive]] and query the raw RUM data recorded under the [[metawiki:Schema:NavigationTiming|NavigationTiming]] [[Analytics/Systems/EventLogging|Eventlogging]] schema. Remember to narrow down your hive queries to the timespan around the regression, as the NavigationTiming table is huge (we record around 14 records per second on average).

Latest revision as of 23:12, 28 October 2022