You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org
Performance/Runbook/SyntheticToolAlert: Difference between revisions
imported>Phedenskog (Added info about ttfb) |
imported>Phedenskog (Update ssh to cloud services) |
||
Line 10: | Line 10: | ||
# Check if the WebPageTest job runner is running. | # Check if the WebPageTest job runner is running. | ||
##Login to the instance: <code>ssh wpt-runner.webperf. | ##Login to the instance: <code>ssh wpt-runner.webperf.eqiad1.wikimedia.cloud</code> | ||
##Check if the job runner is stuck. Run <code>docker ps</code>and check how long the container has been running. If its more than hour, it's stuck. Then kill the container and hit will recover <code>docker kill <container name></code> | ##Check if the job runner is stuck. Run <code>docker ps</code>and check how long the container has been running. If its more than hour, it's stuck. Then kill the container and hit will recover <code>docker kill <container name></code> | ||
##Check the log file ''/tmp/sitespeed.io.log'' . Do you see any errors? Is the tests running? Look for entries like ''<code>[2020-10-07 17:17:38] INFO: Sending url <nowiki>https://en.wikipedia.org/speed-tests/Banksy.enwiki.872156204/</nowiki> to test on wpt.wmftest.org.</code>'' If you see that, you know the runner is working. | ##Check the log file ''/tmp/sitespeed.io.log'' . Do you see any errors? Is the tests running? Look for entries like ''<code>[2020-10-07 17:17:38] INFO: Sending url <nowiki>https://en.wikipedia.org/speed-tests/Banksy.enwiki.872156204/</nowiki> to test on wpt.wmftest.org.</code>'' If you see that, you know the runner is working. | ||
Line 37: | Line 37: | ||
The Crux (Chrome user experience report) data is collected once a day with a job that runs in the crontab. | The Crux (Chrome user experience report) data is collected once a day with a job that runs in the crontab. | ||
# Login to the server that collect the data from the Chrome User Experience Report API: <code>ssh gpsi.webperf. | # Login to the server that collect the data from the Chrome User Experience Report API: <code>ssh gpsi.webperf.eqiad1.wikimedia.cloud</code> | ||
# Check the log located in ''/tmp/sitespeed.io.log'' to see if you see any errors | # Check the log located in ''/tmp/sitespeed.io.log'' to see if you see any errors | ||
# If you found out what's wrong and you fix it, you can manually run the four tests that collects the data (look in. the crontab on how to do that). It doesn't matter when you run the tests, as long as you run them once per day (else we will miss out on data). | # If you found out what's wrong and you fix it, you can manually run the four tests that collects the data (look in. the crontab on how to do that). It doesn't matter when you run the tests, as long as you run them once per day (else we will miss out on data). |
Latest revision as of 13:50, 9 November 2021
This is the runbook for Synthetic tool alerts that will fire when one of the tools are down or we are missing data from that tool.
Meta
- Issue tracker (Phabricator): WebPageTest WebPageReplay
- Documentation: WebPageTest WebPageReplay
WebPageTest missing data
WebPageTests alerts fires if we miss data from WebPageTest
- Check if the WebPageTest job runner is running.
- Login to the instance:
ssh wpt-runner.webperf.eqiad1.wikimedia.cloud
- Check if the job runner is stuck. Run
docker ps
and check how long the container has been running. If its more than hour, it's stuck. Then kill the container and hit will recoverdocker kill <container name>
- Check the log file /tmp/sitespeed.io.log . Do you see any errors? Is the tests running? Look for entries like
[2020-10-07 17:17:38] INFO: Sending url https://en.wikipedia.org/speed-tests/Banksy.enwiki.872156204/ to test on wpt.wmftest.org.
If you see that, you know the runner is working. - If you see errors like
[2020-10-02 00:00:16] ERROR: Could not run test for WebPageTest {"name":"WPTAPIError","code":500,"message":"Internal Server Error"}
Then you know something is wrong on the WebPageTest server/agent.
- Login to the instance:
- Check if the WebPageTest server is working.
- Access the test log page in your browser: http://wpt.wmftest.org/testlog.php?days=1&filter=&all=on
- Do you see any tests? Click on them and check that they look ok
- If you cannot access the tests, log into the WebPageTest server.
ssh -i webpagetest.pem ubuntu@wpt.wmftest.org
- Dig into the different log files and check what you can see: Performance/WebPageTest#Logs
- Check if the WebPageTest agent is running:
ssh -i WebPageTestAgent.pem ubuntu@3.95.23.228
WebPageReplay missing data
- Check if the WebPageReplay server is running by login to the machine.
- Check the log file located at /tmp/sitespeed.io.log. Can you see any errors in the log?
- If the log files seems stuck (nothing has happened in the log for one our or so) check if the container is stuck by listing container status
docker ps
. If the created date is more than one hour ago you know something is wrong, because we a start new containers something like every 10 minutes. Kill the containerdocker kill <container name>
then wait some time, check that a new container is up and running withdocker ps
and then verify that everything looks ok in the log,
WebPageReplay CPU benchmark alert
The CPU benchmark measures how stable the metrics/CPU is on the machine that runs the tests. If it is unstable for some time you need to deploy the tests on a new server.
WebPageReplay TTFB alert
The TTFB (time to first byte) should be really stable when you use WebPageReplay since the tests runs on the same server as the browser. If you have high variation in TTFB something is really wrong on the server and you should try to deploy the tests on a another server.
CRUX missing data
The Crux (Chrome user experience report) data is collected once a day with a job that runs in the crontab.
- Login to the server that collect the data from the Chrome User Experience Report API:
ssh gpsi.webperf.eqiad1.wikimedia.cloud
- Check the log located in /tmp/sitespeed.io.log to see if you see any errors
- If you found out what's wrong and you fix it, you can manually run the four tests that collects the data (look in. the crontab on how to do that). It doesn't matter when you run the tests, as long as you run them once per day (else we will miss out on data).
sitespeed.io missing data
- Check if the sitespeed.io server is running by login to the machine.
- Check the log file located at /tmp/sitespeed.io.log. Can you see any errors in the log?
- If the log files seems stuck (nothing has happened in the log for one our or so) check if the container is stuck by listing container status
docker ps
. If the created date is more than one hour ago you know something is wrong, because we a start new containers something like every 10 minutes. Kill the containerdocker kill <container name>
then wait some time, check that a new container is up and running withdocker ps
and then verify that everything looks ok in the log,