You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Performance/Metrics: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Phedenskog
(Added WebPageTest metrics)
imported>Krinkle
 
(4 intermediate revisions by 3 users not shown)
Line 1: Line 1:
This page documents '''metrics from Performance Team services'''. Currently in [[Graphite]].  
This page documents '''metrics provided by Performance Team services'''.  


{{TOC|limit=2}}
{{TOC|limit=2}}


== Deployment and monitoring ==
== {{Anchor|navtiming-2 metrics}}navtiming ==
See [[Performance/Runbook/Webperf services]] for the internal details of these services, such as where the metrics originate, and how they are processed/aggregated.
These are real-user metrics, collected on a sample of production page views, from the [https://www.w3.org/TR/navigation-timing-2/ W3C Navigation Timing] and [https://w3c.github.io/paint-timing/ W3C Paint Timing] interface in web browsers.


== navtiming-2 metrics ==
Instrumented by [[mw:Extension:NavigationTiming|Extension:NavigationTiming]] ([https://github.com/wikimedia/mediawiki-extensions-NavigationTiming/blob/ddaddec16e93d1a6a15235a4090ac076108e1df4/modules/ext.navigationTiming.js client source code]), processed by our [[Performance/Runbook/Webperf-processor services|webperf-navtiming]] service, and published to [[Graphite]] under the <code>frontend.navtiming2</code> prefix.
The navtiming-2 metrics are available in Graphite under the <code>frontend.navtiming2</code> prefix.


=== Difference with navtiming-1 ===
We publish two kinds of metrics from here: {{Anchor|Offsets|Deltas}}
Notable differences:


* Offsets are computed relative to fetchStart instead of navigationStart.
* '''Milestones''' during a page load. This is an offset from the start of the page load, thus the total duration to that instant in time.
* We no longer filter out zero values.
* '''Durations''' for specific portions of a page load. These measure from the start to end of that particular operation.
* The sanity filter no longer has an upper bound.
* When the sanity filter encounters negative numbers, it rejects the entire event instead of just the individual data point.


See [[phab:T104902]] for more information about why the metrics were redefined.
=== Milestones ===


=== Offsets ===
*<code>responseStart</code>: From navigationStart to here (Navigation Timing).
*<code>domInteractive</code>: From navigationStart to here (Navigation Timing).
*<code>domComplete</code>: From navigationStart to here (Navigation Timing).
*<code>loadEventStart</code>: From navigationStart to here (Navigation Timing).
*<code>loadEventEnd</code>: From navigationStart to here (Navigation Timing). Also known as "page load time ('''PLT''')" or "On load", which typically corresponds with the browser's page loading indicator.
*<code>firstPaint</code>: From navigationStart to [https://w3c.github.io/paint-timing/#first-paint paint-timing].
*<code>firstContentfulPaint</code>: From navigationStart to [https://w3c.github.io/paint-timing/#first-contentful-paint paint-contentful-timing].
*<code>mediaWikiLoadEnd</code>: From navigationStart to the last of the initially loaded JavaScript on a page having finished its initial script execution. This is analogous to when <code>mw.loader.using(RLPAGEMODULES).then</code> would resolve.


* <code>responseStart</code>: From PerformanceTiming, relative to fetchStart.
=== Durations ===
* <code>firstPaint</code>: (non-standard)
* <code>domInteractive</code>: From PerformanceTiming, relative to fetchStart.
* <code>domComplete</code>: From PerformanceTiming, relative to fetchStart.
* <code>loadEventStart</code>: From PerformanceTiming, relative to fetchStart.
* <code>loadEventEnd</code>: From PerformanceTiming, relative to fetchStart. Also known as "Page load end" or "Total page load time". This  typically corresponds with the browser's native page loading indicator.


=== Deltas ===
*<code>dns</code>: Computed as <code>domainLookupEnd - domainLookupStart</code>, our intermediary layer labels this "dnsLookup".
*<code>unload</code>: Computed as <code>unloadEventEnd - unloadEventStart</code>.
*<code>redirect</code>: Computed as <code>redirectEnd - redirectStart</code>, our intermediary layer labels this "redirecting".
*<code>tcp</code>: Computed as <code>connectEnd - connectStart</code>. (As per the spec, browsers include any TLS handshake for HTTPS).
*<code>ssl</code>: Computed as <code>connectEnd - secureConnectionStart</code>. (As per the spec, browsers report this as subset of <code>tcp</code>).
*<code>request</code>: Computed as <code>responseStart - requestStart</code>.
*<code>response</code>: Computed as <code>responseEnd - responseStart</code>.
*<code>processing</code>: Computed as <code>domComplete - responseEnd</code>.
*<code>onLoad</code>: Computed as <code>loadEventEnd - loadEventStart</code>.


* <code>dns</code>: Computed client-side from PerformanceTiming <code>domainLookupEnd - domainLookupStart</code>. (Transmitted as "dnsLookup")
=== See also ===
* <code>unload</code>: Computed client-side from PerformanceTiming <code>unloadEventEnd - unloadEventStart</code>.
See [[phab:T104902]] for how some of these changes over time. In particular:
* <code>redirect</code>: Computed client-side from PerformanceTiming <code>redirectEnd - redirectStart</code>. (Transmitted as "redirecting").
* We don't filter out zero values.
* <code>mediaWikiLoad</code>: Computed client-side based on custom mwLoadEnd and mwLoadStart measures. (Transmitted as "mediaWikiLoadComplete").
* When negative numbers are encountered due to browser bugs, we rejects the entire beacon, not just that one data point. We measure how often this happens in [https://grafana.wikimedia.org/d/000000018/eventlogging-schema?orgId=1&var-schema=NavigationTiming&viewPanel=10 Grafana: EventLogging-schema / NavigationTiming] via the <code>nonCompliant</code> counter. The details of this are logged by [[Performance/Runbook/Webperf-processor services|webperf-navtiming]] to journalctl.
* <code>tcp</code>: Computed server-side as <code>connectEnd - connectStart</code>. This includes SSL negotiation.
* <code>request</code>: Computed server-side as <code>responseStart - requestStart</code>.
* <code>response</code>: Computed server-side as <code>responseEnd - responseStart</code>.
* <code>processing</code>: Computed server-side as <code>domComplete - responseEnd</code>.
* <code>onLoad</code>: Computed server-side as <code>loadEventEnd - loadEventStart</code>.
* <code>ssl</code>: Computed server-side as <code>connectEnd - secureConnectionStart</code>. This is a subset of <code>tcp</code>.


== SaveTiming metrics ==
* Milestone diagram: https://www.w3.org/TR/resource-timing-2/#attribute-descriptions
SaveTiming get reported to <code>mw.performance.save</code> in statsd. To see if it's running properly, the <code>mw.performance.save.sample_rate</code> key should have hits.
 
* [https://grafana.wikimedia.org/dashboard/db/save-timing View Grafana dashboard]  
== {{Anchor|SaveTiming metrics}}Save Timing ==
* [https://performance.wikimedia.org/ View Coal dashboard]
We define two metrics to represent the duratation of time to save of an edit. To save an edit, in MediaWiki, means to create or change a wiki page.
 
=== Backend Save Timing ===
Backend Save Timing measures time spent in MediaWiki PHP, from the process start (<code>REQUEST_TIME_FLOAT</code>) until the response is flushed to the web server for sending to the client (<code>PRESEND</code>). The instrumentation resides in the WikimediaEvents extension ([https://github.com/wikimedia/mediawiki-extensions-WikimediaEvents/blob/51f143cfbd335cafe6e83dce68600c39a17583ee/includes/WikimediaEventsHooks.php#L169-L180 source]), and is published to Graphite under <code>MediaWiki.timing.editResponseTime</code>.
 
The metric is plotted in [https://grafana.wikimedia.org/d/000000429/backend-save-timing-breakdown?orgId=1 Grafana: Backend Save Timing Breakdown], and includes slices by account type (bot vs human), by entry point (index.php wikitext editor, vs api.php for VisualEditor and bots), and by page type or namespace (Wikipedia content, or Wikidata entity, or discussion pages).
 
=== Frontend Save Timing ===
Frontend Save Timing is measured as time from pressing "Publish changes" from a user interface in a web browser (e.g. submitting the edit page form) until that browser recieves the first byte of the server response that will render the confirmation page (e.g. the article with their edit applied and a "[[mw:Post-edit feedback|Post-edit]]" message).
 
This is implemented as <code>navigationStart</code> (the click to submit the form over HTTP POST) to <code>responseStart</code> (the first byte after the server has finished processing the edit, redirected, and responded to the subsequent GET).
 
Instrumented by [[mw:Extension:NavigationTiming|Extension:NavigationTiming]] ([https://github.com/wikimedia/mediawiki-extensions-NavigationTiming/blob/ddaddec16e93d1a6a15235a4090ac076108e1df4/modules/ext.navigationTiming.js client source code]), processed [[Performance/Runbook/Webperf-processor services|webperf-navtiming]], and published to Statsd/Graphite under the <code>mw.performance.save</code>.
 
The metric is plotted toward the bottom of [https://grafana.wikimedia.org/d/000000085/save-timing?orgId=1 Grafana: Save Timing], and includes slices by wiki (group1 is Wikidata/Commons, group2 is Wikipedia, per [[Deployments/Train|Train groups]]).
 
=== See also ===
When investigating Save Timing metrics, it may be useful to correlate with:
 
* [https://performance.wikimedia.org/php-profiling/ performance.wikimedia.org: Flame Graphs], which shows where in the MW codebase time is spent code during particular operations. Use the "api" graph for edits via the API, and the "fn-EditAction" graph for edits via index.php.
 
* [https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?orgId=1 Grafana: Application Servers RED] , which measures the overall load and throughput of our web servers by cluster (e.g. appservers for index.php, api_appservers for api.php) and by method (e.g. POST is mostly edits).
* [https://grafana.wikimedia.org/d/000000208/edit-count? Grafana: Edit Count], which measures the overall rate of edits being saved. This count is derived directly from the Backend Save Timing metric, and thus corresponds fully. Any edit save that we measure for backend save timing is counted here, and vice-versa.


== Synthetic metrics ==
== Synthetic metrics ==
Line 131: Line 153:


=== WebPageReplay ===
=== WebPageReplay ===
The WebPageReplay metrics are available in Graphite under the <code>browsertime</code> prefix (it's Browsertime that collect the metrics).
The WebPageReplay metrics are available in <code>graphite-synthetic-testing</code> under the <code>sitespeed_io</code> prefix (it's Browsertime/sitespeed.io that collect the metrics).


At the moment we collect Visual Metrics and CPU metrics (Chrome only).
At the moment we all the default metrics that is collected when you sitespeed.io. Here's a list of some of the most important ones:


==== Visual Metrics ====
==== Visual Metrics ====
Line 140: Line 162:
*<code>Heading</code>: The time when the first h1 heading is painted at its final position within the viewport.
*<code>Heading</code>: The time when the first h1 heading is painted at its final position within the viewport.
*<code>LargestImage</code>: The time when the largest image is painted at its final position within the viewport.
*<code>LargestImage</code>: The time when the largest image is painted at its final position within the viewport.
*<code>Logo</code>: The time when the logo is painted within the viewport.
*<code>CentralNotice</code>: The time when the central notice banned  is painted at its final position within the viewport.
*<code>VisualComplete85</code>:The time when 85% of the content within the viewport is painted.
*<code>VisualComplete85</code>:The time when 85% of the content within the viewport is painted.
*<code>VisualComplete95</code>:The time when 95% of the content within the viewport is painted.
*<code>VisualComplete95</code>:The time when 95% of the content within the viewport is painted.
Line 151: Line 171:
==== CPU metrics ====
==== CPU metrics ====


*<code>Scripting</code>: Time spent running JavaScript.
*<code>LongTasks.tasks</code> : the number CPU long tasks (> 50 ms)
*<code>Painting</code>: Time spent in painting the screen.
*<code>LongTasks.beforeFirstPaint</code> : the number CPU long tasks happening before anything is painted on the screen (> 50 ms)
*<code>Rendering</code>: Time spent rendering the screen.
*<code>LongTasks.totalDuration</code> : the total time (in ms) spent in CPU long tasks
*<code>Loading</code>: Time spent in loading assets.
*<code>LongTasks.beforeFirstPaint.totalDuration</code> : the total time (in ms) spent in CPU long tasks before anything was painted on the screen.


== See also ==
== See also ==
Line 162: Line 182:
[[Category:Bot and monitoring]]
[[Category:Bot and monitoring]]
[[Category:Performance Team]]
[[Category:Performance Team]]
[[Category:Metrics]]

Latest revision as of 16:08, 12 April 2022

This page documents metrics provided by Performance Team services.

navtiming

These are real-user metrics, collected on a sample of production page views, from the W3C Navigation Timing and W3C Paint Timing interface in web browsers.

Instrumented by Extension:NavigationTiming (client source code), processed by our webperf-navtiming service, and published to Graphite under the frontend.navtiming2 prefix.

We publish two kinds of metrics from here:

  • Milestones during a page load. This is an offset from the start of the page load, thus the total duration to that instant in time.
  • Durations for specific portions of a page load. These measure from the start to end of that particular operation.

Milestones

  • responseStart: From navigationStart to here (Navigation Timing).
  • domInteractive: From navigationStart to here (Navigation Timing).
  • domComplete: From navigationStart to here (Navigation Timing).
  • loadEventStart: From navigationStart to here (Navigation Timing).
  • loadEventEnd: From navigationStart to here (Navigation Timing). Also known as "page load time (PLT)" or "On load", which typically corresponds with the browser's page loading indicator.
  • firstPaint: From navigationStart to paint-timing.
  • firstContentfulPaint: From navigationStart to paint-contentful-timing.
  • mediaWikiLoadEnd: From navigationStart to the last of the initially loaded JavaScript on a page having finished its initial script execution. This is analogous to when mw.loader.using(RLPAGEMODULES).then would resolve.

Durations

  • dns: Computed as domainLookupEnd - domainLookupStart, our intermediary layer labels this "dnsLookup".
  • unload: Computed as unloadEventEnd - unloadEventStart.
  • redirect: Computed as redirectEnd - redirectStart, our intermediary layer labels this "redirecting".
  • tcp: Computed as connectEnd - connectStart. (As per the spec, browsers include any TLS handshake for HTTPS).
  • ssl: Computed as connectEnd - secureConnectionStart. (As per the spec, browsers report this as subset of tcp).
  • request: Computed as responseStart - requestStart.
  • response: Computed as responseEnd - responseStart.
  • processing: Computed as domComplete - responseEnd.
  • onLoad: Computed as loadEventEnd - loadEventStart.

See also

See phab:T104902 for how some of these changes over time. In particular:

  • We don't filter out zero values.
  • When negative numbers are encountered due to browser bugs, we rejects the entire beacon, not just that one data point. We measure how often this happens in Grafana: EventLogging-schema / NavigationTiming via the nonCompliant counter. The details of this are logged by webperf-navtiming to journalctl.

Save Timing

We define two metrics to represent the duratation of time to save of an edit. To save an edit, in MediaWiki, means to create or change a wiki page.

Backend Save Timing

Backend Save Timing measures time spent in MediaWiki PHP, from the process start (REQUEST_TIME_FLOAT) until the response is flushed to the web server for sending to the client (PRESEND). The instrumentation resides in the WikimediaEvents extension (source), and is published to Graphite under MediaWiki.timing.editResponseTime.

The metric is plotted in Grafana: Backend Save Timing Breakdown, and includes slices by account type (bot vs human), by entry point (index.php wikitext editor, vs api.php for VisualEditor and bots), and by page type or namespace (Wikipedia content, or Wikidata entity, or discussion pages).

Frontend Save Timing

Frontend Save Timing is measured as time from pressing "Publish changes" from a user interface in a web browser (e.g. submitting the edit page form) until that browser recieves the first byte of the server response that will render the confirmation page (e.g. the article with their edit applied and a "Post-edit" message).

This is implemented as navigationStart (the click to submit the form over HTTP POST) to responseStart (the first byte after the server has finished processing the edit, redirected, and responded to the subsequent GET).

Instrumented by Extension:NavigationTiming (client source code), processed webperf-navtiming, and published to Statsd/Graphite under the mw.performance.save.

The metric is plotted toward the bottom of Grafana: Save Timing, and includes slices by wiki (group1 is Wikidata/Commons, group2 is Wikipedia, per Train groups).

See also

When investigating Save Timing metrics, it may be useful to correlate with:

  • performance.wikimedia.org: Flame Graphs, which shows where in the MW codebase time is spent code during particular operations. Use the "api" graph for edits via the API, and the "fn-EditAction" graph for edits via index.php.
  • Grafana: Application Servers RED , which measures the overall load and throughput of our web servers by cluster (e.g. appservers for index.php, api_appservers for api.php) and by method (e.g. POST is mostly edits).
  • Grafana: Edit Count, which measures the overall rate of edits being saved. This count is derived directly from the Backend Save Timing metric, and thus corresponds fully. Any edit save that we measure for backend save timing is counted here, and vice-versa.

Synthetic metrics

The synthetic metrics are collected using VisualMetrics analyzing a video recording of the screen when the page is loading.

WebPageTest

The WebPageTest metrics are available in Graphite under the webpagetest prefix.

Visual Metrics

  • render: The time when something for the first time is painted within the viewport.
  • BackgroundImage: The time when the largest background image is painted at its final position within the viewport.
  • Heading: The time when the first h1/h2 heading is painted at its final position within the viewport.
  • LargestImage:The time when the largest image is painted at its final position within the viewport.
  • SpeedIndex:The Speed Index is the average time at which visible parts of the page are displayed. It is expressed in milliseconds and dependent on size of the view port.
  • lastVisualChange:The time when the last paint happens within the viewport.
  • VisualComplete85::The time when 85% of the content within the viewport is painted
  • VisualComplete95:The time when 95% of the content within the viewport is painted.
  • VisualComplete99:The time when 99% of the content within the viewport is painted.

Other timing metrics

  • fullyLoaded:The time all requests on the page has finished loading.
  • TTFB: The time to first byte delivered from the server.
  • domComplete: domComplete from the Navigation Timing APi.

Size and requests

WebPageTest also collects the number of bytes/request per type. Add .bytesor .requeststo the type to get that information.

  • html
  • css
  • js
  • flash
  • font
  • video
  • image
  • total: The total amount of bytes/requests for the tested page.

CPU times

These metrics are Chrome only.

  • Idle:Time spent being idle.
  • Layout:Time spent rendering the screen
  • Painting: Time spent in painting the screen.
  • Scripting:Time spent running JavaScript
  • Loading:Time spent in loading assets.

Visual Metrics

  • render: The time when something for the first time is painted within the viewport.
  • BackgroundImage: The time when the largest background image is painted at its final position within the viewport.
  • Heading: The time when the first h1/h2 heading is painted at its final position within the viewport.
  • LargestImage:The time when the largest image is painted at its final position within the viewport.
  • SpeedIndex:The Speed Index is the average time at which visible parts of the page are displayed. It is expressed in milliseconds and dependent on size of the view port.
  • lastVisualChange:The time when the last paint happens within the viewport.
  • VisualComplete85::The time when 85% of the content within the viewport is painted.
  • VisualComplete95:The time when 95% of the content within the viewport is painted.
  • VisualComplete99:The time when 99% of the content within the viewport is painted.

Other timing metrics

  • fullyLoaded:The time all requests on the page has finished loading.
  • TTFB: The time to first byte delivered from the server.
  • domComplete: domComplete from the Navigation Timing APi.

Size and requests

WebPageTest also collects the number of bytes/request per type. Add .bytes or .requests to the type to get that information.

  • html
  • css
  • js
  • flash
  • font
  • video
  • image
  • total: The total amount of bytes/requests for the tested page.

CPU times

These metrics are Chrome only.

  • Idle:Time spent being idle.
  • Layout:Time spent rendering the screen.
  • Painting: Time spent in painting the screen.
  • Scripting:Time spent running JavaScript.
  • Loading:Time spent in loading assets.

WebPageReplay

The WebPageReplay metrics are available in graphite-synthetic-testing under the sitespeed_io prefix (it's Browsertime/sitespeed.io that collect the metrics).

At the moment we all the default metrics that is collected when you sitespeed.io. Here's a list of some of the most important ones:

Visual Metrics

  • FirstVisualChange:The time when something for the first time is painted within the viewport.
  • Heading: The time when the first h1 heading is painted at its final position within the viewport.
  • LargestImage: The time when the largest image is painted at its final position within the viewport.
  • VisualComplete85:The time when 85% of the content within the viewport is painted.
  • VisualComplete95:The time when 95% of the content within the viewport is painted.
  • VisualComplete99:The time when 99% of the content within the viewport is painted.
  • LastVisualChange:The time when the last paint happens within the viewport.
  • SpeedIndex: The Speed Index is the average time at which visible parts of the page are displayed. It is expressed in milliseconds and dependent on size of the view port.
  • PerceptualSpeedIndex: The Perceptual Speed Index is sensitive to page elements moving around (that is not Speed Index). For example if a campaign/banner pushes the elements, that will have bigger impact on the perceptual metric.

CPU metrics

  • LongTasks.tasks : the number CPU long tasks (> 50 ms)
  • LongTasks.beforeFirstPaint : the number CPU long tasks happening before anything is painted on the screen (> 50 ms)
  • LongTasks.totalDuration : the total time (in ms) spent in CPU long tasks
  • LongTasks.beforeFirstPaint.totalDuration : the total time (in ms) spent in CPU long tasks before anything was painted on the screen.

See also