You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Performance/Runbook/SyntheticToolAlert

From Wikitech-static
< Performance‎ | Runbook
Revision as of 10:21, 28 October 2021 by imported>Phedenskog (Added info about ttfb)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

This is the runbook for Synthetic tool alerts that will fire when one of the tools are down or we are missing data from that tool.

Meta

WebPageTest missing data

WebPageTests alerts fires if we miss data from WebPageTest

  1. Check if the WebPageTest job runner is running.
    1. Login to the instance: ssh wpt-runner.webperf.eqiad.wmflabs
    2. Check if the job runner is stuck. Run docker psand check how long the container has been running. If its more than hour, it's stuck. Then kill the container and hit will recover docker kill <container name>
    3. Check the log file /tmp/sitespeed.io.log . Do you see any errors? Is the tests running? Look for entries like [2020-10-07 17:17:38] INFO: Sending url https://en.wikipedia.org/speed-tests/Banksy.enwiki.872156204/ to test on wpt.wmftest.org. If you see that, you know the runner is working.
    4. If you see errors like [2020-10-02 00:00:16] ERROR: Could not run test for WebPageTest {"name":"WPTAPIError","code":500,"message":"Internal Server Error"} Then you know something is wrong on the WebPageTest server/agent.
  2. Check if the WebPageTest server is working.
    1. Access the test log page in your browser: http://wpt.wmftest.org/testlog.php?days=1&filter=&all=on
    2. Do you see any tests? Click on them and check that they look ok
    3. If you cannot access the tests, log into the WebPageTest server.
    4. ssh -i webpagetest.pem ubuntu@wpt.wmftest.org
    5. Dig into the different log files and check what you can see: Performance/WebPageTest#Logs
  3. Check if the WebPageTest agent is running: ssh -i WebPageTestAgent.pem ubuntu@3.95.23.228

WebPageReplay missing data

  1. Check if the WebPageReplay server is running by login to the machine.
  2. Check the log file located at /tmp/sitespeed.io.log. Can you see any errors in the log?
  3. If the log files seems stuck (nothing has happened in the log for one our or so) check if the container is stuck by listing container status docker ps. If the created date is more than one hour ago you know something is wrong, because we a start new containers something like every 10 minutes. Kill the container docker kill <container name> then wait some time, check that a new container is up and running with docker ps and then verify that everything looks ok in the log,

WebPageReplay CPU benchmark alert

The CPU benchmark measures how stable the metrics/CPU is on the machine that runs the tests. If it is unstable for some time you need to deploy the tests on a new server.

WebPageReplay TTFB alert

The TTFB (time to first byte) should be really stable when you use WebPageReplay since the tests runs on the same server as the browser. If you have high variation in TTFB something is really wrong on the server and you should try to deploy the tests on a another server.

CRUX missing data

The Crux (Chrome user experience report) data is collected once a day with a job that runs in the crontab.

  1. Login to the server that collect the data from the Chrome User Experience Report API: ssh gpsi.webperf.eqiad.wmflabs
  2. Check the log located in /tmp/sitespeed.io.log to see if you see any errors
  3. If you found out what's wrong and you fix it, you can manually run the four tests that collects the data (look in. the crontab on how to do that). It doesn't matter when you run the tests, as long as you run them once per day (else we will miss out on data).

sitespeed.io missing data

  • Check if the sitespeed.io server is running by login to the machine.
  • Check the log file located at /tmp/sitespeed.io.log. Can you see any errors in the log?
  • If the log files seems stuck (nothing has happened in the log for one our or so) check if the container is stuck by listing container status docker ps. If the created date is more than one hour ago you know something is wrong, because we a start new containers something like every 10 minutes. Kill the container docker kill <container name> then wait some time, check that a new container is up and running with docker ps and then verify that everything looks ok in the log,