You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Performance/Runbook/Webperf-processor services
This is the run book for deploying and monitoring webperf-processor services.
Hosts
The puppet role for these services is role::webperf:processors_and_site.
- webperf1001 (Eqiad cluster): Grafana host monitor.
- webperf2001 (Codfw cluster): Grafana host monitor.
- deployment-webperf11 (Beta cluster): grafana-labs machine stats.
The navtiming
service (written in Python) extracts information for the NavigationTiming and SaveTiming schemas from EventLogging using Kafka. It submits them to Graphite via Statsd. The EventLogging data comes a JS plugin for MediaWiki (beacon js source, MediaWiki extension).
Meta
- User documentation: Performance/Metrics.
- Source code: performance/navtiming.git
- Code review: Recent Gerrit activity
- Puppet class: webperf::navtiming
Application logs for this service are not sent to Logstash currently.
- Ssh to the host you want to monitor.
- Run
sudo journalctl -u navtiming -f -n100
This service runs on the webperf*1 hosts.
To update the service on the Beta Cluster:
- ssh to
deployment-webperf11.deployment-prep.eqiad.wmflabs
- run
sudo journalctl -u navtiming -f -n100
and keep this open during the following steps - in a new tab, ssh to
deployment-deploy01.deployment-prep.eqiad.wmflabs
cd /srv/deployment/performance/navtiming
git pull
scap deploy
- Review the scap output (here) and the journalctl output (on the webperf server) for any errors.
To deploy a change in production:
- Before you start, open a terminal window in which you monitor the service on a host in the currently main data center. For example, ssh to
webperf1001.eqiad.wmnet
(if Eqiad is primary) and runsudo journalctl -u navtiming -f -n100
. - In another terminal window, ssh to deployment.eqiad.wmnet and navigate to
/srv/deployment/performance/navtiming
. - Prepare the working copy:
- Ensure the working copy is clean,
git status
. - Fetch the latest changes from Gerrit remote,
git fetch origin
. - Review the changes,
git log -p HEAD..@{u}
. - Apply the changes to the working copy,
git rebase
.
- Ensure the working copy is clean,
- Deploy the changes, this will automatically restarts the service afterward.
- Run
scap deploy
- Run
sudo systemctl restart navtiming
coal
Written in Python.
- User documentation: Performance/Metrics (explanation of the data processed by Coal).
- Source code: performance/coal.git (Gerrit).
- Deployed using Scap3.
- Puppet class: coal::processor.
Application logs are kept locally, and can be read via sudo journalctl -u coal
.
Reprocessing past periods
Coal data for an already processed period can be overwritten safely. To backfill a period after an outage, run coal manually on one of the perf hosts (no need to stop the existing process), using a different consumer group, and use the --start-timestamp option (careful about the timestamp being expressed in milliseconds since Epoch). Once you see that the outage gap has been filled, you can safely stop that manual coal process.
Restart coal
sudo systemctl restart coal
statsv
The statsv
service (written in Python) forwards data from the Kafka stream for /beacon/statsv
web requests to Statsd.
- Source code: analytics/statsv.git (Gerrit).
- Deployed using Scap3.
- Puppet class: webperf::statsv.
Application logs are kept locally, and can be read via sudo journalctl -u statsv
.
Restart statsv
sudo systemctl restart statsv
coal-web
Written in Python.
- Source code: performance/coal.git (Gerrit).
- Puppet class: coal::web.
site
- Source code: performance/docroot.git (Gerrit).
- Puppet class: profile::webperf::site.
This powers the site at https://performance.wikimedia.org/. Beta Cluster instance at https://performance.wikimedia.beta.wmflabs.org/.