You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

Performance/Runbook/SyntheticToolAlert: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Phedenskog
(Added info about ttfb)
 
imported>Phedenskog
(Update ssh to cloud services)
 
Line 10: Line 10:


# Check if the WebPageTest job runner is running.
# Check if the WebPageTest job runner is running.
##Login to the instance: <code>ssh wpt-runner.webperf.eqiad.wmflabs</code>
##Login to the instance: <code>ssh wpt-runner.webperf.eqiad1.wikimedia.cloud</code>
##Check if the job runner is stuck. Run <code>docker ps</code>and check how long the container has been running. If its more than hour, it's stuck. Then kill the container and hit will recover <code>docker kill <container name></code>
##Check if the job runner is stuck. Run <code>docker ps</code>and check how long the container has been running. If its more than hour, it's stuck. Then kill the container and hit will recover <code>docker kill <container name></code>
##Check the log file ''/tmp/sitespeed.io.log'' . Do you see any errors? Is the tests running? Look for entries like ''<code>[2020-10-07 17:17:38] INFO: Sending url <nowiki>https://en.wikipedia.org/speed-tests/Banksy.enwiki.872156204/</nowiki> to test on wpt.wmftest.org.</code>'' If you see that, you know the runner is working.
##Check the log file ''/tmp/sitespeed.io.log'' . Do you see any errors? Is the tests running? Look for entries like ''<code>[2020-10-07 17:17:38] INFO: Sending url <nowiki>https://en.wikipedia.org/speed-tests/Banksy.enwiki.872156204/</nowiki> to test on wpt.wmftest.org.</code>'' If you see that, you know the runner is working.
Line 37: Line 37:
The Crux (Chrome user experience report) data is collected once a day with a job that runs in the crontab.
The Crux (Chrome user experience report) data is collected once a day with a job that runs in the crontab.


# Login to the server that collect the data from the Chrome User Experience Report API:  <code>ssh gpsi.webperf.eqiad.wmflabs</code>  
# Login to the server that collect the data from the Chrome User Experience Report API:  <code>ssh gpsi.webperf.eqiad1.wikimedia.cloud</code>
# Check the log located in ''/tmp/sitespeed.io.log'' to see if you see any errors
# Check the log located in ''/tmp/sitespeed.io.log'' to see if you see any errors
# If you found out what's wrong and you fix it, you can manually run the four tests that collects the data (look in. the crontab on how to do that). It doesn't matter when you run the tests, as long as you run them once per day (else we will miss out on data).
# If you found out what's wrong and you fix it, you can manually run the four tests that collects the data (look in. the crontab on how to do that). It doesn't matter when you run the tests, as long as you run them once per day (else we will miss out on data).

Latest revision as of 13:50, 9 November 2021

This is the runbook for Synthetic tool alerts that will fire when one of the tools are down or we are missing data from that tool.

Meta

WebPageTest missing data

WebPageTests alerts fires if we miss data from WebPageTest

  1. Check if the WebPageTest job runner is running.
    1. Login to the instance: ssh wpt-runner.webperf.eqiad1.wikimedia.cloud
    2. Check if the job runner is stuck. Run docker psand check how long the container has been running. If its more than hour, it's stuck. Then kill the container and hit will recover docker kill <container name>
    3. Check the log file /tmp/sitespeed.io.log . Do you see any errors? Is the tests running? Look for entries like [2020-10-07 17:17:38] INFO: Sending url https://en.wikipedia.org/speed-tests/Banksy.enwiki.872156204/ to test on wpt.wmftest.org. If you see that, you know the runner is working.
    4. If you see errors like [2020-10-02 00:00:16] ERROR: Could not run test for WebPageTest {"name":"WPTAPIError","code":500,"message":"Internal Server Error"} Then you know something is wrong on the WebPageTest server/agent.
  2. Check if the WebPageTest server is working.
    1. Access the test log page in your browser: http://wpt.wmftest.org/testlog.php?days=1&filter=&all=on
    2. Do you see any tests? Click on them and check that they look ok
    3. If you cannot access the tests, log into the WebPageTest server.
    4. ssh -i webpagetest.pem ubuntu@wpt.wmftest.org
    5. Dig into the different log files and check what you can see: Performance/WebPageTest#Logs
  3. Check if the WebPageTest agent is running: ssh -i WebPageTestAgent.pem ubuntu@3.95.23.228

WebPageReplay missing data

  1. Check if the WebPageReplay server is running by login to the machine.
  2. Check the log file located at /tmp/sitespeed.io.log. Can you see any errors in the log?
  3. If the log files seems stuck (nothing has happened in the log for one our or so) check if the container is stuck by listing container status docker ps. If the created date is more than one hour ago you know something is wrong, because we a start new containers something like every 10 minutes. Kill the container docker kill <container name> then wait some time, check that a new container is up and running with docker ps and then verify that everything looks ok in the log,

WebPageReplay CPU benchmark alert

The CPU benchmark measures how stable the metrics/CPU is on the machine that runs the tests. If it is unstable for some time you need to deploy the tests on a new server.

WebPageReplay TTFB alert

The TTFB (time to first byte) should be really stable when you use WebPageReplay since the tests runs on the same server as the browser. If you have high variation in TTFB something is really wrong on the server and you should try to deploy the tests on a another server.

CRUX missing data

The Crux (Chrome user experience report) data is collected once a day with a job that runs in the crontab.

  1. Login to the server that collect the data from the Chrome User Experience Report API: ssh gpsi.webperf.eqiad1.wikimedia.cloud
  2. Check the log located in /tmp/sitespeed.io.log to see if you see any errors
  3. If you found out what's wrong and you fix it, you can manually run the four tests that collects the data (look in. the crontab on how to do that). It doesn't matter when you run the tests, as long as you run them once per day (else we will miss out on data).

sitespeed.io missing data

  • Check if the sitespeed.io server is running by login to the machine.
  • Check the log file located at /tmp/sitespeed.io.log. Can you see any errors in the log?
  • If the log files seems stuck (nothing has happened in the log for one our or so) check if the container is stuck by listing container status docker ps. If the created date is more than one hour ago you know something is wrong, because we a start new containers something like every 10 minutes. Kill the container docker kill <container name> then wait some time, check that a new container is up and running with docker ps and then verify that everything looks ok in the log,