You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
grafana.wikimedia.org
grafana.wikimedia.org is a frontend for creating queries and storing dashboards using data from Graphite and other datasources.
Service
Currently hosted on and served from grafana1002.
Editing dashboards
To edit dashboards, you need to be a member of the cn=nda or cn=wmf LDAP groups. https://grafana.wikimedia.org is read-only, to edit dashboards (or change administrative settings) you need to access the separate vhost https://grafana-rw.wikimedia.org; hitting the "login" link at the bottom of the left sidebar will also redirect you as needed. The Grafana web interface is integrated into our web SSO identity provider based on Apereo CAS.
Private dashboards
A folder for private dashboards (or the same name) is also available. Dashboards created in (or moved to) this folder will require logging into Grafana to be able to view. Please use this feature sparingly, and default to public dashboards unless absolutely needed (cfr e.g. bug T267930 for one such case)
Viewing
Most viewing features can be discovered naturally, but here's a few features you may not immediately realise exist:
- Dynamic time range, you can zoom in and focus on any portion of a plot by selecting and dragging within the graph.
- Metrics, you can click on metric names in the graph legend to isolate a single metric, or ctrl/cmd click to exclude a metric.
- Annotations, such as for deployments, can be toggled by clicking the lightning icon on the top left.
Save dashboards in puppet
For critical dashboards it is important to have revision control, to this end it is possible to save dashboards in puppet and have them effectively read-only in Grafana.
NB the dashboard url will change suffix, from /db/<dashboard> to /file/<dashboard>.json
Import a new dashboard
- Clone operations/puppet repository, see operations/puppet.git
- cd puppet/modules/grafana/files/dashboards
- Import a dashboard with ../grafana-dashboard DASHBOARD_URL (requires python and
requests
), the filename will be the same as the dashboard's name.- The dashboard will get tagged with
source:puppet.git
andreadonly
if it doesn't carry the tags already.
- The dashboard will get tagged with
- Add
grafana::dashboard
resource torole::grafana
e.g. https://gerrit.wikimedia.org/r/#/c/268085 and commit in git - Send and schedule the code review for next Puppet request window
Update an existing dashboard
- Save the readonly dashboard under another name (e.g. NAME-DASHBOARD) and make the desired changes
- Import in puppet at files/grafana/dashboards with ../grafana-dashboard NEW_DASHBOARD_URL like above
- The new dashboard will get saved under a different name, thus rename to desired name and commit
- Send and schedule the code review for next Puppet request window
- (Optional) Delete the modified dashboard in grafana
Import a dashboard over an already imported dashboard (rename)
- Make a note of the target uid and title fields
- Use grafana-dashboard.py in the usual way, but with --title --uid and --filename flags populated with the values from the target dashboard.
- Send and schedule the code review for next Puppet request window
- (Optional) Delete the modified dashboard in grafana
Grizzly
Grizzly is a utility for managing various observability resources with Jsonnet. We are currently piloting this to manage our grafana dashboards as code.
Status
Currently grizzly is a pre-production service, please reach out to the observabilty team if you’d like to become an early adopter by migrating your favorite dashboards to grizzly.
Workflow
Changing a dashboard
Preparing the Change
- First, clone the operations/grafana-grizzly git repository https://gerrit.wikimedia.org/r/admin/repos/operations/grafana-grizzly
- Upload a patch with your changes. See the Varnish SLO Dashboard change as an example. If you wish to experiment/test the changes there is a test environment in pontoon accessible at http://o11y-grafana.wmcloud.org
- Patch review, feel free to tag anyone in sre-observability for a review.
Deploying the change
After the patch has been reviewed and merged (currently requiring a manual V+2), the working repository on the grafana hosts (/srv/grafana-grizzly) will be updated on the next puppet run.
grafana1002:~$ sudo puppet agent -t
grafana1002:~$ cd /srv/grafana-grizzly
grafana1002:/srv/grafana-grizzly$ grr diff example.jsonnet
# Manually review the diff, make sure it looks good to you
#
# Note: grizzly will output "Dashboard/dashboard_name not present in Dashboard" if the Dashboard does not yet exist in grafana, and will not show a diff. In this case you can use 'grr preview' to generate a snapshot of the dashboard in Grafana for review.
grafana1002:/srv/grafana-grizzly$ grr preview example.jsonnet
# When ready to deploy:
grafana1002:/srv/grafana-grizzly$ grr apply example.jsonnet
Importing a dashboard
Grizzly supports importing dashboards from grafana, however a few steps may be necessary to fully adapt the json. Typically this involves adding/adjusting the id of the dashboard to something grizzly will recognize.
There are a couple ways to go about this. You can fetch the json through the grafana UI, or attempt to pull it using grizzly itself.
In order to pull using grizzly, you will need the uid of the dashboard. One way to find this is in dashboard properties within the grafana web interface
Usage Examples
grafana1002:/srv/grafana-grizzly$ grr list slo_dashboards.jsonnet
API VERSION KIND UID
grizzly.grafana.com/v1alpha1 Dashboard slo-etcd
grizzly.grafana.com/v1alpha1 Dashboard slo-logstash
grafana1002:/srv/grafana-grizly# grr diff slo_dashboards.jsonnet
Dashboard/slo-etcd no differences
Dashboard/slo-logstash no differences
Style Guide
Conventions
- Grizzly dashboards should be tagged as ‘grizzly’
- Grizzly managed dashboards should be placed into a directory within grafana that contains dashboards managed only by grizzly.
- Dashboards of similar types should be grouped into a single jsonnet file, which can include additional dashboard json files as required.
Notes
The grr command itself is configured via environment variables containing attribuets like grafana server url, api key, etc. A wrapper has been deployed as /usr/local/bin/grr to supply these values from a file readable by group ops /etc/grafana/grizzly.env
Features
To enable the shared crosshair (which draws the current target cursor in all graphs on the page), go to "Configure dashboard" (top right menu). Then tick the "Shared Crosshair" setting in the Features section.
By default each data point of each metric has its own tooltip, only shown when hovering the exact point. Consider enabling the "All series" tooltip. This will ensure the tooltip is always shown when inside the graph. All points on the vertical axis are shown in a single tooltip at the same same. Horizontally the closest data point will be shown in the tooltip.
- Click on the graph title and select Edit.
- In the Display Styles section, enable tooltip "All series".
- From the top navigation, go back to the dashboard.
Show deployments
Add MediaWiki deployment events as annotations to your dashboard:
- Enable the "Annotations" feature from the "Configure dashboard" panel (top right menu).
- From new settings menu on the top left, choose Annotations.
- Add a new annotation.
- Name: MW deploy
- Data source: graphite
- Color: Light grey
- Graphite target expression:
exclude(aliasByNode(deploy.*.count,-2),"all")
- Click Update, and close the settings menu.
Input variables
See Grafana Templated dashboards.
Time correction
From "Configure dashboard" (top right menu) one can change the default ("browser time") to use "UTC" instead.
Negative axis
If you're plotting metrics with the intention to show some of them as negative, apply "Transform: negative-Y" from "Display > Series overides".[1] This will visually flip the values in the graph (as negative), whilst preserving the positive values for legends and crosshairs. This is preferred over modifications like scale(-1)
, which will affects other displays of the metric as well and can cause confusion
Example: Server board - Network traffic (plots upload and download bandwidth)
Alerts (with notifications via Icinga)
Alerts can be set up through Grafana on each panel. For example, this panel has an alert set when the Varnish cache hit ratio for ResourceLoader requests drops below a certain percentage.
For most alerts that query data from Graphite, it makes sense to use "Keep Last State" for error conditions and missing data. (Because it is not unusual for Graphite to fail to respond to a request intermittently, and also because data for one of the minutes can be missing in certain race conditions).
In order to receive email notifications about Grafana alerts, you need to connect an Icinga contact group to a given dashboard by making some changes in Puppet configuration. All alerts from a given dashboard will be sent to the same "contact_group". The "Notifications" tab in the Grafana interface is not used (background at T153167).
Example of lines to add to a file in puppet.git:/modules/icinga/manifests/monitor/
class icinga::monitor::example {
monitoring::grafana_alert { 'db/resourceloader-alerts':
contact_group => 'team-performance',
}
}
If not specified, contact_group defaults to "admin" which is irc only. Full list available in puppet.git:modules/nagios_common/files/contactgroups.cfg
See also:
- T152473 and T153167: Original Grafana alerting work by the performance team.
- Example puppet change to add Icinga alerting for specific dashboards.
Annotations based on Prometheus data
Grafana Annotations allow marking specific points on the graph with a vertical line and an associated description. The information about when a given event has occurred can be extracted with a Prometheus query.
To add a Prometheus-based Annotation:
- Choose "Annotations" from the settings button (gear icon)
- Click on "New"
- Choose a name for the new annotation and a Prometheus data source
- Insert a Prometheus query returning 1 when the annotation should be displayed. For example, in case of a metric tracking uptime in seconds, you can add an annotation to show when the service is started by using the resets() function. For example: resets(service_uptime{site=~"$site"}[5m]) > bool 0
- Add a the label that will be displayed when moving the cursor over the annotation (triangle at the bottom of the vertical line). To do that, fill the "Field formats" section of the form and specify some constant text under "Title" and a comma separated list of "Tags" which must be Prometheus labels returned by the query (eg: instance, job).
Best practices
If you are creating or updating a dashboard, see also Performance/Runbook/Grafana best practices for a list of best practices.
Common pitfalls
Alerts fire but the threshold was not reached
Reported upstream at https://github.com/grafana/grafana/issues/12134.
It is common for a newly configured alert to fire within days for a value that, later, cannot be found in the graph. The reason for this is likely due to the alert query having a "to" time of "now". This a problem especially with data queried from Graphite where the data for the current minute may be null or otherwise incomplete. Resolve this by always giving the alert query a "to" time of at least 1 minute in the past. For example, from 1h
, to now-1m
.
Alerts for values derived from multiple metrics fire unexpectedly
When writing an alert for a value that is derived from multiple metrics (e.g. "cache_miss.rate" and "cache_hit.rate"), be sure to have the alert query until now-5m instead of until now because the last few data points may not be complete. Especially if they come from different servers. When evaluating math in Graphite in a way that involves a single metric, null remains null. But when involving multiple metrics, null is treated as zero. This can cause percentage values derived from two or more metrics to temporarily become a nonsensical value that can trigger your alert. Frustratingly, there is a good chance that by the time you look at the alert dashboard, the value will be complete, and no amount of zooming into the time frame where the alert occurred will reveal the bad value.
For more info, see also Graphite gotcha: Null in math (Grafana blog).
Recover after making the dashboard not editable
Delete the dashboard via the API, then restore it without the toggle.
Known issues
Alerts with asPercent() not working
movingAverage template
When using a template inside movingAverage, the default mode wrongly expands the variable (it adds quotes around it, instead of leaving it as a number). These have to be removed manually by editing the metric directly (click on the pencil). Whenever the metric is changed, it has to be fixed again.
Color inspector broken
After each time you change a color through a color picker you must click on empty space anywhere outside the color picker. Otherwise the value will not be saved. (The little square will reflect your chosen color, but once applied, it will be lost). If you try to click "Invert", "Save", "Update" or one of the other color squares directly, the change will be lost.
Beta cluster
To use Grafana in the beta cluster, use https://grafana-labs.wikimedia.org for viewing / https://grafana-labs-admin.wikimedia.org/ for editing. If you copy a dashboard from production, you need to change the data source to Labs Graphite
and replace the top-level MediaWiki
with BetaMediaWiki
. (Or use Prod Graphite
to test a dashboard with production data.)
Pipeline
The Deployment Pipeline is well supported in Grafana. All services that are deployed in it benefit from ready to make dashboards that have basic functionality and structure already set. Most of the dashboards follow the RED/4 golden signals approach, by providing Traffic (aka Rate), Errors, Latency(aka Duration) and Saturation rows and panels in a dashboard named as the service. The hierarchy for the pipeline is under the Service folder.
While we have started experimenting with Grafana Grizzly to maintain this hierarchy, for now the process of instantiating and maintaining a new dashboard is manual. It consists of copying the Template Dashboard from the Service folder, changing the service variable to the name of the service (specifically the k8s namespace) and saving.
Usage for product analytics purposes
Grafana/Graphite is very useful for incident monitoring, but is less suitable for systematically analyzing data for product decisions. For example:
- It does not allow easy comparison of data along dimensions like browser family or project domain, like Superset and Turnilo do. (This can be a limitation for technical investigations too, see e.g. phab:T166414 or [1].)
- The underlying data in Graphite can’t be queried easily like we can do for EventLogging data. This makes it more difficult to vet and debug an instrumentation, and to answer more involved data questions (that go beyond time series data).
- Also, Graphite compresses data after some time, making it hard to use it for investigating/comparing historical data.
That said, every EventLogging schema has an associated Grafana board (always linked from the schema talk page - example) which is valuable for monitoring its overall event rate.
Operations
Version upgrade
This section details how to roll out a Grafana version upgrade.
- Download the latest Debian package from https://grafana.com/grafana/download
- Copy the package and install it on the host acting as backend for https://grafana-next.wikimedia.org. The mapping is in puppet at hieradata/common/profile/trafficserver/backend.yaml.
- Verify basic functionality (login, view, edit)
- Update the APT repository on the host serving apt.wikimedia.org as follows:
root@apt1001:~# reprepro --noskipold --restrict grafana checkupdate buster-wikimedia root@apt1001:~# reprepro --noskipold --restrict grafana update buster-wikimedia
- Backup the database prior to upgrade on the main Grafana host.
cp /var/lib/grafana/grafana.db /var/lib/grafana/grafana.db-$(date -I)
- Upgrade the package
apt -q update apt install grafana
- Roll out the upgrade to cloudmetrics hosts. The full list of grafana hosts is obtained with cumin C:grafana
Notes
- ↑ Negative y-transform, What's new in Grafana v2.1
External links
- https://grafana.wikimedia.org
- http://grafana.org/
- http://docs.grafana.org/guides/whats-new-in-v2/
- http://docs.grafana.org/guides/whats-new-in-v2-1/
Subpages
- Home
- NavTiming
- Template
- echoflyout
- echoflyout.json
- eventlogging-schema.json
- eventloggingschema
- home.json
- labs-project-board.json
- media.json
- mw-js-deprecate.json
- mwjsdeprecate
- mwjsdeprecate.json
- navigation-timing
- navigation-timing.json
- ores.json
- performance-metrics-copy.json
- performance-metrics.json
- resourceloader
- resourceloader.json
- save-timing
- save-timing.json
- server-board.json
- template-dashboard.json
- varnish-http-errors
- varnish-http-errors.json
- webpagetest.json