You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Performance/Runbook/Grafana best practices

From Wikitech-static
Jump to navigation Jump to search

This is a list of best practices for Grafana dashboards maintained by the Performance Team.

High-level checklist

  • Do all graphs have a left Y with a useful and correct unit?
  • Can you tell what a graph represents exactly? (e.g. how it is aggregated) Is it obvious?

Dashboard layout

Write a dashboard legend. For an example, see ResourceLoader and Backend Pageview Time

  • Create a "Text" panel, and leave it at the very top of the dashboard in its unnamed row. Title the panel "Legend".
  • Describe the subject of the dashboard in one sentence (e.g. what does the service do for end-users? what interaction does it instrument?)
  • Summarise in a sentence or two the flow of the data from the instrumentation source all the way to the Grafana screen (e.g. instrumented with a Statsd measure in milliseconds, or aggregated via mtail to Prometheus).
  • Consider linking to a documentation page about the service, or the launch task of the instrumentation.
  • Considering linking to the source code of the instrumentation.

Dashboard settings

General settings

  • Editable: Yes.
  • Preferred timezone: UTC.
  • Preferred range: Last N days for most dashboards. Last N hours for alert dashboards.
  • Auto refresh: Provide options for 5min and 15min. If on by default, use 5min as the default interval. Avoid smaller intervals due to unnecessary load on metric database. If you need to be notified, consider using an alert instead.
  • Graph tooltip: Enable the shared crosshair.

Annotations

Manual annotations

You can create annotations within Grafana for any moment or range of time. These can then be associated with one or more tags. On each dashboard you can decide which tags you'd like to query for shared annotations. For example, most Performance-team dashboards query "mediawiki", "performance", and "operations". Which means an annotation created by anyone from any dashboard with one of these tags will be shown in the panels on that dashboard.

  • Edit the default "Annotations & Alerts" annotation.
  • Leave the default settings (Enabled: Yes, Hidden: Yes, Color: Blue / Cyan).
  • Filter by: Tags.
  • Match: "any".
  • Tags: (insert one or more globally shared tags).

MediaWiki deployments

If the service or instrumentation may be affected by MediaWiki deployments, enable one or both of the following annotations:

All MediaWiki deployments:

  • Name: MW deploy. Data source: graphite.
  • Enabled: No. Hidden: No. Color: Orange.
  • Query: exclude(aliasByNode(deploy.*.count,-2),"all")

Only full branch promotions part of the Train:

  • Name: Train deploy. Data source: graphite.
  • Enabled: Yes (this is the default state for the dashboard). Hidden: No (this means the control is shown and you can enable it ad-hoc when you need it).
  • Color: Orange.
  • Query: exclude(aliasByNode(deploy.sync-wikiversions.count,-2),"all")

Graph panels

Keep your graph focussed

When creating a graph, keep in mind what question you want the graph to answer. If possible, focus on a single metric only.

More than three metrics is usually a sign that a graph may be attempting to answer too many questions at once. This can be problematic as it may cause it to be unable to accurately answer any of the questions involved, for example due to axes having to span a wide range of values, or due to it being difficult to correlate the number of colors, lines, and labels.

One case where you do want to consider many metrics in one graph, is when wanting to understand the relationship between quantities and their distribution. See #Graph with many metrics below.

Draw mode

When plotting metrics that represent a quantity per interval, use a bar chart (e.g. rate counter, CPU usage percentage, bytes gauge for memory or disk).

For timing metrics, use a line chart.

Graph recommended settings

General:

  • ...

Metrics:

  • Remember to use .rate, when querying Statsd counters from Graphite. Never use count or sum. (Why: Graphite#Extended properties.)
  • Preferred scale for counters is per second, and otherwise per minute.
  • For timing metrics, prefer plotting the max (Statsd: upper). Otherwise, consider p99 or p75. Avoid lower percentiles, medians, or mean averages. (Why: Measuring load times.)
  • Prefer minimal or no aggregations in queries. If aggregation is applied, be sure to clearly indicate this in the legend. You can use the alias function to describe how the value is produced. For example, frontend.navtiming2.responseStart.mobile.p75 | movingAverage (24h) | alias("responseStart.mobile.p75 | movingAverage (24h)"). Notice how the movingAverage is specified both as actual query function and as text for the alias function.

Axes:

  • Always include a Left Y-axis on graph panels.
  • Unit: Set this correctly for timing metrics and percentages. For counters, we typically use the "short" notation.
  • Label: Use this to document the scale of counting metrics (e.g. "rate per minute"). The label is usually left blank for timing metrics.
  • Min/Max: Usually left to auto. For percentage graphs that can't exceed 100%, do set a max of 100% to avoid the automatic margin expansion to 120%.

Legend:

  • ..

Display:

  • Draw Mode: Bars or Lines.
  • Line width: 1. Line fill: 1.
  • Tooltip: All series. If the graph contains more than a dozen metrics, use Single instead.
  • Null value: null. (Setting this to Continuous or Zero almost always causes issues, eventually.)

Graph with alert rules

  • .. (TODO: Info thing.)
  • .. (TODO: threshold thing.)

Graph with many metrics

When plotting more than a dozen metrics with the intent to understand distribution, it is recommended to create a stacked bar chart (not a line graph). Like so:

  • Display: Set Drawing mode to Bars, and enable Stacking mode. Ensure the hover value is stacked "individually".
  • Legend: Hide the legend (its too growded). Alternatively, show as scrollable table to the right.

Alert rules

  • Evaluate every: 15 min.
  • Query condition: Range for last 15min or 1h, until now-5min.
  • If no data or all nulls: Alerting. (This helps detect when the underlying service may be down or broken. We used to ignore this due to a bug in Graphite, but as of January 2019 we're trying it again.)
  • If error or timeout: Keep Last State. (Graphite often times out; when using Prometheus consider Alerting on errors.)

See also