You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org
Application servers/Runbook
Apache
Testing config
- Choose one of the mediawiki debug servers. Then, on that server:
- Disable puppet:
sudo puppet agent --disable 'insert reason'
- Apply change locally under
/etc/apache2/sites-enabled/
sudo apache2ctl restart
- Disable puppet:
- Test your change by making relevant HTTP request. See Debugging in production for how.
- When you're done,
sudo puppet agent --enable
Deploying config
It is suggested that you may wish to place any configuration updates on the Deployments page. A bad configuration going live can easily result in a site outage.
- Test your change in deployment-prep and make sure that it works as expected.
- In the operations/puppet repository, make your change in the
modules/mediawiki/files/apache/sites
directory. - In the same commit, add one or more httpbb tests in the
modules/profile/files/httpbb
directory, asserting that your change works as you intend. (Consider automating the same checks you just performed by hand.)- For example, if you are adding or modifying a RewriteRule, please add tests covering some URLs that are expected to change.
- On deploy1001, run all httpbb tests on an affected host. Neither your changes nor your new tests are in effect yet, so any test failures are unrelated. All tests are expected to pass -- if they don't, you should track down and fix the problem before continuing.
rzl@deploy1001:~$ httpbb /srv/deployment/httpbb-tests/* --host mwdebug1001.eqiad.wmnet Sending to mwdebug1001.eqiad.wmnet... PASS: 99 requests sent to mwdebug1001.eqiad.wmnet. All assertions passed.
- Submit your change and tests to gerrit as a single commit.
- Disable puppet across the affected mediawiki application servers.
- Cumin can in finding the precise set of hosts. For example, this is a recent query: In this case the change was related to a RewriteRule change in 04-remnant.conf, but of course it must be changed every time with the file(s) modified by the Gerrit change.
cumin 'R:File = "/etc/apache2/sites-available/04-remnant.conf"' 'disable-puppet "elukey - precaution for https://gerrit.wikimedia.org/r/#/c/380774/"' -b 10
- Cumin can in finding the precise set of hosts. For example, this is a recent query:
- Merge via gerrit and run on puppetmaster1001 the usual
puppet-merge
- Go to one of the mwdebug servers and enable/run puppet. Apache will reload its configuration automatically, please check that no error messages are emitted. Running apachectl -t after running puppet surely helps verifying that the new configuration is syntactically correct (it doesn't absolutely imply that it will work as intended of course).
- Some Apache directive changes need a full restart to get applied, not a simple reload. These changes are very rare and they are clearly indicated in Apache's documentation, so please verify it beforehand. Simple RewriteRule changes require only an Apache reload.
- On deploy1001, re-run all httpbb tests on an affected host. Your new tests verify that your intended change is functioning correctly, and re-running the old tests verifies that existing behavior wasn't inadvertently changed in the process. All tests are expected to pass -- if they don't, you should revert your change.
rzl@deploy1001:~$ httpbb /srv/deployment/httpbb-tests/* --host mwdebug1001.eqiad.wmnet Sending to mwdebug1001.eqiad.wmnet... PASS: 101 requests sent to mwdebug1001.eqiad.wmnet. All assertions passed.
- Enable/Run puppet on another mediawiki application server that is taking traffic, de-pooling it beforehand via confctl. Verify again from deploy1001 that everything is working as expected, running httpbb.
- Repool the host mentioned above and verify on Apache access logs that everything looks fine. If you want to be extra paranoid, you can check the host level metrics via https://grafana.wikimedia.org/d/000000327/apache-hhvm?orgId=1 and make sure that nothing is out of the ordinary.
- Re-enable puppet across the appservers previously disabled via cumin.
- Keep an eye on the operations channel and make sure that puppet runs fine on these hosts.
Nginx
TODO.
Mcrouter
TODO.
Envoy
Envoy is used for proxying calls from MediaWiki to external services. It's a resilient service, and it should not fail usually. Some quick pointers:
- Logs are under /var/log/envoy.
- /var/log/envoy/syslog.log (or sudo journalctl -u envoyproxy.service) to see the daemon logs
- Verify that configuration is valid: sudo -u envoy /usr/bin/envoy -c /etc/envoy/envoy.yaml -mode validate.
- Envoy uses a hot restarter that allows seamless restarts without losing a request. Use systemctl reload envoyproxy.service unless you really know why that wouldn't work.
- You can check the status of envoy and much other info under http://localhost:9631. Of specific utility is /stats which returns current stats. Refer to the admin interface docs for details.
If you see an error about runtime variables being set, reloading envoy should solve the alert in a few minutes.
PHP 7
PHP 7 is the interpreter we use for serving mediawiki. This page collects resources about how to troubleshoot and fix some potential issues with it. For more general information about how we serve mediawiki in production see the page about Application servers
Logging
- Local daemon logs are in /var/log/php7.2-fpm/
- MediaWiki Errors on kibana
Dashboards
- General Apache/HHVM dashboardStill doesn't have many php7-specific data, but it will contain them at a later date.
- PHP7 transition A dashboard with most of the salient data about the ongoing transition to php7
- MediaWiki Appservers A general dashboard for appservers including a lot of useful metrics.
Debugging procedures and tools
php7adm
php7adm is a tool that allows to interact with the local php-fpm daemon to gather information on the running status of the application. Its usage is pretty simple:
$ php7adm [OPTION]
To see a list of available actions, just run the command without arguments:
$ php7adm
Supported urls:
/metrics Metrics about APCu and OPcache usage
/apcu-info Show basic APCu stats
/apcu-meta Dump meta information for all objects in APCu to /tmp/apcu_dump_meta
/apcu-free Clear all data from APCu
/opcache-info Show basic opcache stats
/opcache-meta Dump meta information for all objects in opcache to /tmp/opcache_dump_meta
/opcache-free Clear all data from opcache
All data, apart from the /metrics endpoint, are reported in json format.
Low-level debugging
Php-fpm is a prefork style appserver, which means that every child process will be serving just one request at a time. So attaching with strace to an individual process should give you a lot of information on what is going on there. We still don't have an automated dumper of stacktraces from php-fpm, but you can use as usual quickstack for a quick peek at the stacktraces, or gdb for more details.
Response to common alerts
PHP7 opcache health
This alert can arise in three different scenarios:
- The opcache is full
- The opcache has a too low cache hit ratio
- The opcache has little free space
It's quite possible multiple servers get the same alert at the same time. That's because what uses up opcache is deployments, so the opcache usage goes hand in hand for servers that have been restarted before the same deployment. If you want to know more about what's going on, you can fetch the info yourself:
$ php7adm /opcache-info | jq .
{
"opcache_enabled": true,
"cache_full": false,
"restart_pending": false,
"restart_in_progress": false,
"memory_usage": {
...
While we should have a cron checking for these conditions and doing the work for us sooner than later, you can still fix the issue by safely doing a restart of php-fpm:
$ sudo -i /usr/local/sbin/restart-php7.2-fpm
Be careful If you're restarting multiple servers this way, as restart-php7.2-fpm depools the server, restarts php-fpm, then repools the server. You should never run restart on more than 10% of the servers in a cluster at the same time.
PHP7 rendering
This alert comes from trying to render a page on enwiki using php7 (not HHVM). Since the request goes through apache httpd, first check if apache is alerting as well. In that case, something else might be at play (like, HHVM being stuck and using up all of httpd's connection slots). Then look at opcache alerts. If there is a critical alert on opcache too, look at the corresponding section below. If only the php7 rendering is alerting, check the following:
- What does the php-fpm log say? Any specific errors repeating right now?
$ tail -f /var/log/php7.2-fpm/error.log | fgrep -v '[NOTICE]'
Jun 0 00:00:00 server php7.2-fpm[pid]: [WARNING] [pool www] child PID, script '/srv/mediawiki/docroot/wikipedia.org/w/index.php' (request: "GET /wiki/Special:Random") executing too slow (XX.XX sec), logging
...
For example, if you see a lot of slow requests from a specific PID, it might be interesting to strace it. If some strange and unique error message is present, probably the opcache is corrupted. In that case confirm by resetting opcache and verifying the problem supersedes. Search if we have an open ticket about opcache corruptions and register the occurrence there.
- What can you see looking at non-200 responses that come from php-fpm in the apache log? Any trend? anything stands out?
# This will just show 5xx errors, nothing else.
$ tail -n100000 -f /var/log/apache2/other_vhosts_access.log | fgrep fcgi://localhost/5
- If nothing conclusive comes out of it, you can still probe the processes with the usual debugging tools. In that case, depool the server for good measure
$ sudo -i depool
IMPORTANT: remember to also repool it afterwards.
If this happens once, and on just one server, I suggest to just restart php-fpm
$ sudo -i /usr/local/sbin/restart-php7.2-fpm
and watch the logs/icinga to see if the issue superseeded. If the issue is on more than just one server, escalate to the SRE team responsible for the service.
Service Ops
Adding a new server into production
- Create DNS patch to assign IP addresses to them. This is usually done by dcops nowadays but they might want your review for it. (example change)
- Create a puppet patch that adds the servers with the right regexes in site.pp. Apply the spare::system puppet role. (example change)
- Create mcrouter certs, merge them in the private puppet repo on the puppetmaster (as of today Puppetmaster1001).
- Create a patch to add fake certs in the labs/private repo. Merge it. In the labs/private repo you have to also add the V+2 yourself, no jenkins. (example change)
- Decide which role this server should have (appserver, API appserver, jobrunner,..). Use Netbox to search for the host and see which rack it is in. Try to balance server roles across both racks and rows.
- Create a puppet patch that adds the proper role to the servers and adds them in conftool-data in the right section. Don't merge it yet. (example change)
- Disable puppet on Icinga to avoid Icinga alert spam. ex: [icinga1001:~] $ sudo puppet agent --disable <reason/ticket ID>
- Schedule Icinga downtimes for your new hosts for 1h. ex: dzahn@cumin1001:~$ sudo cookbook sre.hosts.downtime -r new_install -t T236437 -H 1 mw13[63,74-83].eqiad.wmnet
- Merge the patch to add puppet roles to the new servers.
- Force a puppet run via cumin. Some errors are normal in the first puppet run. ex: dzahn@cumin1001:~$ sudo -i cumin -b 15 'mw13[63,74-83].eqiad.wmnet' 'run-puppet-agent -q'
- Force a second puppet run via cumin. It should complete successfully.
- Re-enable puppet on icinga: [icinga1001:~] $ sudo puppet agent --enable
- Run downtime with force-puppet-run via cumin ex: dzahn@cumin1001:~$ sudo cookbook sre.hosts.downtime -r new_install -t T236437 -H 1 --force-puppet mw13[63,74-83].eqiad.wmnet
- Watch all (new) Icinga alerts on the hosts turn green, make sure Apache does not have to be restarted. You can "reschedule next service check" in the Icinga web UI to speed things up. It is expected that the "not in dsh group" alert stays CRIT until the server is pooled below. Once all alerts besides that one are green (not PENDING and not CRIT) it is ok to go ahead.
- Check for ongoing deployments. Wait if that is the case. You can use "jouncebot: now" on IRC and/or the Deployment page on Wikitech wiki.
- Run "scap pull" on new servers to ensure latest MediaWiki version deployed is present.
- Give the server a weight with confctl: ex: [cumin1001:~] $ sudo -i confctl select name=mw1355.eqiad.wmnet set/weight=30
- Pool the server with confctl: ex: [cumin1001:~] $ sudo -i confctl select name=mw1353.eqiad.wmnet set/pooled=yes
- Watch Grafana Host Overview, select server and see it is getting traffic.
Spreading application servers out across rows and racks
We are aiming to spread out application server roles (regular appserver, API appserver, etc) across both rows (ex. B) as well as racks (ex. B3) in each of the main data centers (currently eqiad and codfw).
Our new pattern to achieve this is alternating between appserver and API appserver in each row where odd numbers represent appservers and even numbers represent API appservers.
example:
mw1385 - appserver - rack A5 mw1386 - API server - rack A5 mw1387 - appserver - rack A5 mw1388 - API server - rack A5 ..
In puppet's site.pp this results in a structure with regexes like this:
## DATACENTER: EQIAD .. # Appservers # Row A .. # rack A5 node /^mw13(8[579]|91)\.eqiad\.wmnet$/ { role(mediawiki::appserver) } ... # rack A5 node /^mw13(8[68]|9[02])\.eqiad\.wmnet$/ { role(mediawiki::appserver::api) } # Row B ... ## DATACENTER: CODFW .. # Appservers # Row A .. # rack A4
In this example rack A5 is split across the 2 roles and ideally the same pattern should repeat for each rack in each row in each datacenter.
Removing old appservers from production (decom)
- Identify servers you want to decom in netbox. The procurement ticket linked from there tells you the purchase date to see how old they are.
- Create a Gerrit patch that removes the servers from site.pp and conftool-data. (example change) but don't merge it yet.
- Set the servers to 'pooled=no' and watch in Grafana how they stop serving traffic, temperature goes down etc. ex: [cumin1001:~] $ sudo -i confctl select 'name=mw123[2-5].eqiad.wmnet' set/pooled=no
- Use the downtime cookbook to schedule monitoring downtimes for the servers. Give a reason and link to your decom ticket. ex: [cumin1001:~] $ sudo cookbook sre.hosts.downtime -r decom -t T247780 -H 2 mw125[0-3].eqiad.wmnet.
- If everything seems fine, set the servers to 'pooled=inactive' now. ex: [cumin1001:~] $ sudo -i confctl select 'name=mw125[0-3].eqiad.wmnet' set/pooled=inactive
- If you are sure, run the actual decom cookbook now. This step is destructive so you will have to reinstall servers to revert. ex: [cumin1001:~] $ sudo cookbook sre.hosts.decommission mw125[0-3].eqiad.wmnet -t T247780
- Merge your prepared puppet change to remove them from site and conftool-data.
- optional: Run puppet on Icinga and see the servers and services on them disappear from monitoring.
- optional: Confirm in Netbox the state of the servers is "decommissioning" now.
- Create and merge a change in the puppet repo to remove the servers from DHCP config (and check for other occurences of the hostnames).
- Check if any of the servers was an mcrouter proxy (hieradata/common/mcrouter.yaml) or a scap proxy (hieradata/common/scap/dsh.yaml). Remove if needed. (example change)
- Create and merge a change in the DNS repo to remove the production IPs and mgmt IPs while keeping the asset tag names for the mgmt interfaces.
- Hand over the decom ticket to dcops for physical unracking and the final steps in the server lifecycle.