You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Load-balancing pooling and depooling
scap pullon an appserver before pooling
# Pooling a server sudo confctl select dc=$datacenter,cluster=$appserver_cluster,name=$host set/pooled=yes # Depooling a server sudo confctl select dc=$datacenter,cluster=$appserver_cluster,name=$host set/pooled=no # Depool a server and remove it from software distribution sudo confctl select dc=$datacenter,cluster=$appserver_cluster,name=$host set/pooled=inactive # Set a server's weight sudo confctl select dc=$datacenter,cluster=$appserver_cluster,name=$host set/weight=$weight # All-in-one sudo confctl select dc=$datacenter,cluster=$appserver_cluster,name=$host set/pooled=yes:weight=$weight
From the host itself
# Pooling a server scap pull; sudo -i pool # Depooling a server sudo -i depool
- Appservers RED
- Envoy Telemetry
- Mediawiki Error logs
- API call logs
- Apache2 AccessLogs (MW-on-K8s)
- php-fpm slowlog dashboard (MW-on-K8s)
- Choose one of the mediawiki debug servers. Then, on that server:
- Disable puppet:
sudo disable-puppet 'insert reason'
- Apply change locally under
sudo apache2ctl restart
- Disable puppet:
- Test your change by making relevant HTTP request. See Debugging in production for how.
- When you're done,
sudo enable-puppet 'insert reason'
It is suggested that you may wish to place any configuration updates on the Deployments page. A bad configuration going live can easily result in a site outage.
- Test your change in deployment-prep and make sure that it works as expected.
- In the operations/puppet repository, make your change in the
- In the same commit, add one or more httpbb tests in the
modules/profile/files/httpbbdirectory, asserting that your change works as you intend. (Consider automating the same checks you just performed by hand.)
- For example, if you are adding or modifying a RewriteRule, please add tests covering some URLs that are expected to change.
- On deploy1001, run all httpbb tests on an affected host. Neither your changes nor your new tests are in effect yet, so any test failures are unrelated. All tests are expected to pass -- if they don't, you should track down and fix the problem before continuing.
rzl@deploy1001:~$ httpbb /srv/deployment/httpbb-tests/appserver/* --host mwdebug1001.eqiad.wmnet Sending to mwdebug1001.eqiad.wmnet... PASS: 99 requests sent to mwdebug1001.eqiad.wmnet. All assertions passed.
- Submit your change and tests to gerrit as a single commit.
- Disable puppet across the affected mediawiki application servers.
- Cumin can in finding the precise set of hosts. For example, this is a recent query: In this case the change was related to a RewriteRule change in 04-remnant.conf, but of course it must be changed every time with the file(s) modified by the Gerrit change.
cumin 'R:File = "/etc/apache2/sites-available/04-remnant.conf"' 'disable-puppet "elukey - precaution for https://gerrit.wikimedia.org/r/#/c/380774/"' -b 10
- Cumin can in finding the precise set of hosts. For example, this is a recent query:
- Merge via gerrit and run on puppetmaster1001 the usual
- Go to one of the mwdebug servers and enable/run puppet. Apache will reload its configuration automatically, please check that no error messages are emitted. Running apachectl -t after running puppet surely helps verifying that the new configuration is syntactically correct (it doesn't absolutely imply that it will work as intended of course).
- Some Apache directive changes need a full restart to get applied, not a simple reload. These changes are very rare and they are clearly indicated in Apache's documentation, so please verify it beforehand. Simple RewriteRule changes require only an Apache reload.
- On deploy1001, re-run all httpbb tests on an affected host. Your new tests verify that your intended change is functioning correctly, and re-running the old tests verifies that existing behavior wasn't inadvertently changed in the process. All tests are expected to pass -- if they don't, you should revert your change.
rzl@deploy1001:~$ httpbb /srv/deployment/httpbb-tests/* --host mwdebug1001.eqiad.wmnet Sending to mwdebug1001.eqiad.wmnet... PASS: 101 requests sent to mwdebug1001.eqiad.wmnet. All assertions passed.
- Enable/Run puppet on another mediawiki application server that is taking traffic, de-pooling it beforehand via confctl. Verify again from deploy1001 that everything is working as expected, running httpbb.
- Repool the host mentioned above and verify on Apache access logs that everything looks fine. If you want to be extra paranoid, you can check the host level metrics via https://grafana.wikimedia.org/d/000000327/apache-fcgi?orgId=1 and make sure that nothing is out of the ordinary.
- Re-enable puppet across the appservers previously disabled via cumin.
- Keep an eye on the operations channel and make sure that puppet runs fine on these hosts.
You can find apache's request log at
Mcrouter never breaks (TM)
See Debugging in production#Debugging databases.
Envoy is used for:
- TLS termination: envoy listens on 443 and proxy passes the request to apache listening on 80
- Services proxy: for proxying calls from MediaWiki to external services
It's a resilient service, and it should not fail usually. Some quick pointers:
- Logs are under /var/log/envoy.
- /var/log/envoy/syslog.log (or sudo journalctl -u envoyproxy.service) to see the daemon logs
- Verify that configuration is valid: sudo -u envoy /usr/bin/envoy -c /etc/envoy/envoy.yaml --mode validate.
- Envoy uses a hot restarter that allows seamless restarts without losing a request. Use systemctl reload envoyproxy.service unless you really know why that wouldn't work.
- You can check the status of envoy and much other info under http://localhost:9631. Of specific utility is /stats which returns current stats. Refer to the admin interface docs for details.
If you see an error about runtime variables being set, reloading envoy should solve the alert in a few minutes.
PHP 7 is the interpreter we use for serving mediawiki. This page collects resources about how to troubleshoot and fix some potential issues with it. For more general information about how we serve mediawiki in production see the page about Application servers
The php-fpm daemon logs are sent via Rsyslog to Kafka/Logstash under
type:syslog program:php7.4-fpm. They are also stored locally under
The php-fpm slow requests log can be found at
The MediaWiki application logs are sent directly to Rsyslog at localhost:10514 (per wmf-config/logging.php) and end up in Logstash under
type:mediawiki. These can be tailed locally via
sudo tcpdump -i any -l -A -s0 port 10514.
Any other raw syslog() calls in PHP, such as from php-wmerrors, also end up Logstash under
type:mediawiki. These can be tailed on their way out (to Kafka or Udp2log) via
sudo tcpdump -i any -l -A -s0 port 8420. This will include the MediaWiki application logs as well.
- General Apache/HHVM dashboardStill doesn't have many php7-specific data, but it will contain them at a later date.
- PHP7 transition A dashboard with most of the salient data about the ongoing transition to php7
- MediaWiki Appservers A general dashboard for appservers including a lot of useful metrics.
Debugging procedures and tools
php7adm is a tool that allows to interact with the local php-fpm daemon to gather information on the running status of the application. Its usage is pretty simple:
$ php7adm [OPTION]
To see a list of available actions, just run the command without arguments:
$ php7adm Supported urls: /metrics Metrics about APCu and OPcache usage /apcu-info Show basic APCu stats /apcu-meta Dump meta information for all objects in APCu to /tmp/apcu_dump_meta /apcu-free Clear all data from APCu /opcache-info Show basic opcache stats /opcache-meta Dump meta information for all objects in opcache to /tmp/opcache_dump_meta /opcache-free Clear all data from opcache
All data, apart from the /metrics endpoint, are reported in json format.
Php-fpm is a prefork style appserver, which means that every child process will be serving just one request at a time. So attaching with
strace to an individual process should give you a lot of information on what is going on there. We still don't have an automated dumper of stacktraces from php-fpm, but you can use as usual
quickstack for a quick peek at the stacktraces, or gdb for more details.
Response to common alerts
Average latency exceeded
This alert means something is currently very wrong, and MediaWiki is responding to clients at unusually slow pace. This can be due to a number of reasons, but typically a slowness of response
from all servers means some backend system is responding slowly. A typical troubleshooting should go as follows:
- Check the application server RED dashboard in the panels "mcrouter" and "databases" to quickly see if anything stands out
- Check SAL for any deployments corresponding to the time of the alert or a few minutes earlier. If there is any, request a rollback while you keep debugging. Worst case scenario, the changes will have to be deployed again, but in many cases you'll have the resolution of the outage.
- ssh to one server in the cluster that is experiencing the issue. Check the last entries in the php-fpm slowlog (located at
/var/log/php7.4*-slowlog.log) If all requests you see popping up are blocked in a specific function, that should give you a pointer to what isn't working: caches, databases, backend services
- For databases go check the slow query dashboard on logstash
- For caches, you can go check the memcached dashboards on grafana.
- For curl requests, you can check the envoy telemetry dashboard - set the origin cluster to the cluster where you're seeing latency (excluding
local_port_XXwhich is pointing to the local appserver)
- If none of the above works, escalate the problem to the wider team
This alert comes from trying to render a page on enwiki using php7 (not HHVM). Since the request goes through apache httpd, first check if apache is alerting as well, then look at opcache alerts. If there is a critical alert on opcache too, look at the corresponding section below. If only the php7 rendering is alerting, check the following:
- What does the php-fpm log say? Any specific errors repeating right now?
$ tail -f /var/log/php7.4-fpm/error.log | fgrep -v '[NOTICE]' Jun 0 00:00:00 server php7.4-fpm[pid]: [WARNING] [pool www] child PID, script '/srv/mediawiki/docroot/wikipedia.org/w/index.php' (request: "GET /wiki/Special:Random") executing too slow (XX.XX sec), logging ...
For example, if you see a lot of slow requests from a specific PID, it might be interesting to strace it. If some strange and unique error message is present, probably the opcache is corrupted. In that case confirm by resetting opcache and verifying the problem supersedes. Search if we have an open ticket about opcache corruptions and register the occurrence there.
- What can you see looking at non-200 responses that come from php-fpm in the apache log? Any trend? anything stands out?
# This will just show 5xx errors, nothing else. $ tail -n100000 -f /var/log/apache2/other_vhosts_access.log | fgrep fcgi://localhost/5
- If nothing conclusive comes out of it, you can still probe the processes with the usual debugging tools. In that case, depool the server for good measure
$ sudo -i depool
IMPORTANT: Remember to repool the server afterwards.
If this happens once, and on just one server, I suggest to just restart php-fpm
$ sudo -i /usr/local/sbin/restart-php7.4-fpm
and watch the logs/icinga to see if the issue superseeded. If the issue is on more than just one server, escalate to the SRE team responsible for the service.
We currently use the same hosts as both jobrunners and videoscalers, and sometimes their performance is impacted by an overwhelming amount of video encodes. Quick diagnostic: many (100+) ffmpeg processes running on a jobrunners server, icinga checks timing out, etc.
You should also look at the overall health of the jobrunner server group (not split by jobrunner vs videoscaler).
Jobs are much more important to run on time than video scaling, so the current solution is to dedicate some servers to jobrunning and others to just videoscaling:
- Verify current config in eqiad:
confctl select 'dc=eqiad,cluster=videoscaler' get confctl select 'dc=eqiad,cluster=jobrunner' get
When dividing the clusters, it's recommended to put the videoscalers on better hardware since those tasks are more CPU intensive.
- Sample command for making a mw server dedicated to jobrunning:
confctl select 'cluster=videoscaler,name=mw1111.eqiad.wmnet' set/pooled=no confctl select 'cluster=jobrunner,name=mw1111.eqiad.wmnet' set/pooled=yes
You should also log into the host and kill any remaining ffmpeg processes (
sudo pkill ffmpeg). The job queue should automatically retry them later.
scap proxy and canary
A scap proxy is an intermediate rsync proxy between the deployment host and the rest of the production infrastructure.
You can find a list of them in hieradata/common/scap/dsh.yaml
A canary is one of the first hosts to have new code deployed via scap. It is checked by scap for its error rate, and scap auto-aborts the deployment if it is too high.
To list them:
confctl select service=canary get
Adding a new server into production
- Create DNS patch to assign IP addresses to them. This is usually done by dcops nowadays but they might want your review for it. (example change)
- Create a puppet patch that adds the servers with the right regexes in site.pp. Apply the spare::system puppet role. (example change)
- Decide which role this server should have (appserver, API appserver, jobrunner,..). Use Netbox to search for the host and see which rack it is in. Try to balance server roles across both racks and rows.
- Create a puppet patch that adds the proper role to the servers and adds them in conftool-data in the right section. Don't merge it yet. (example change)
- Schedule Icinga downtimes for your new hosts for 1h. ex: dzahn@cumin1001:~$ sudo cookbook sre.hosts.downtime -r new_install -t T236437 -H 1 mw13[63,74-83].eqiad.wmnet
- Merge the patch to add puppet roles to the new servers.
- Force a puppet run via cumin. Some errors are normal in the first puppet run. ex: dzahn@cumin1001:~$ sudo -i cumin -b 15 'mw13[63,74-83].eqiad.wmnet' 'run-puppet-agent -q'
- Force a second puppet run via cumin. It should complete successfully.
- Run downtime with force-puppet-run via cumin ex:
sudo cookbook sre.hosts.downtime -r new_install -t T236437 -H 1 --force-puppet mw13[63,74-83].eqiad.wmnet
- Run a restart of all apache2 processes ex:
sudo cumin mw24[20-51].codfw.wmnet 'systemctl restart apache2'
- Watch all (new) Icinga alerts on the hosts turn green, make sure Apache does not have to be restarted. You can "reschedule next service check" in the Icinga web UI to speed things up. It is expected that the "not in dsh group" alert stays CRIT until the server is pooled below. Once all alerts besides that one are green (not PENDING and not CRIT) it is ok to go ahead.
- Check for ongoing deployments. Wait if that is the case. You can use "
jouncebot: now" on IRC, or check the Deployments page.
- Run "scap pull" on new servers to ensure latest MediaWiki version deployed is present.
- Give the server a weight with confctl: ex:
[cumin1001:~] $ sudo -i confctl select name=mw1355.eqiad.wmnet set/weight=30
- Pool the server with confctl: ex:
[cumin1001:~] $ sudo -i confctl select name=mw1353.eqiad.wmnet set/pooled=yes
- Watch Grafana Host Overview, select server and see it is getting traffic.
Spreading application servers out across rows and racks
We aim to spread out application server roles (regular appserver, API appserver, etc) across both rows (ex. B) as well as racks (ex. B3) in each of the main data centers (currently eqiad and codfw). When an entire rack or entire row fails, this distribution of hosts minimizes the impact to any single role.
Removing old appservers from production (decom)
- Identify servers you want to decom in netbox. The procurement ticket linked from there tells you the purchase date to see how old they are.
- Create a Gerrit patch that removes the servers from site.pp and conftool-data. (example change) but don't merge it yet.
- Set the servers to 'pooled=no' and watch in Grafana how they stop serving traffic, temperature goes down etc. ex: [cumin1001:~] $ sudo -i confctl select 'name=mw123[2-5].eqiad.wmnet' set/pooled=no
- If needed, make and deploy any mediawiki-config changes
- Use the downtime cookbook to schedule monitoring downtimes for the servers. Give a reason and link to your decom ticket. ex: [cumin1001:~] $ sudo cookbook sre.hosts.downtime -r decom -t T247780 -H 2 mw125[0-3].eqiad.wmnet.
- If everything seems fine, set the servers to 'pooled=inactive' now. ex: [cumin1001:~] $ sudo -i confctl select 'name=mw125[0-3].eqiad.wmnet' set/pooled=inactive
- If you are sure, run the actual decom cookbook now. This step is destructive so you will have to reinstall servers to revert. ex: [cumin1001:~] $ sudo cookbook sre.hosts.decommission mw125[0-3].eqiad.wmnet -t T247780
- Merge your prepared puppet change to remove them from site and conftool-data.
- optional: Run puppet on Icinga and see the servers and services on them disappear from monitoring.
- optional: Confirm in Netbox the state of the servers is "decommissioning" now.
- Check for any other occurrences of the hostnames in the puppet repo.
- Check if any of the servers was a scap proxy (hieradata/common/scap/dsh.yaml). Remove if needed. (example change)
- Hand over the decom ticket to dcops for physical unracking and the final steps in the server lifecycle.