You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

Application servers: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Dzahn
imported>Krinkle
mNo edit summary
Line 1: Line 1:
{{Navigation Wikimedia infrastructure|expand=mw}}
{{See|See also '''[[Application servers/Runbook]]''' for how to perform common tasks, or diagnose issues.}}
The '''Application servers''' (or '''app servers''') are the several hundred Apache servers that run the [[MediaWiki]] backend software (written in PHP).
The '''Application servers''' (or '''app servers''') are the several hundred Apache servers that run the [[MediaWiki]] backend software (written in PHP).


The Apache configurations are maintained in the Puppet repository, at [https://github.com/wikimedia/operations-puppet/tree/production/modules/mediawiki/files/apache/sites operations/puppet.git:/modules/mediawiki/files/apache/sites/]. Prior to 2012, these were in Subversion.
:''See also '''[[Application servers/Runbook]]''' which covers php-fpm, opcache and more.'' (TODO: Should these pages be merged?)
{{TOC|align=right}}
{{TOC|align=right}}
==Testing config==
* Submit change to [[Gerrit]] in the <code>modules/mediawiki/files/apache/sites</code> directory (project: operations/puppet)
*Choose [[X-Wikimedia-Debug#Available_backends|one of the mediawiki debug servers]]. Then, on that server:
**Disable puppet: <code>sudo puppet agent --disable 'insert reason'</code>
** Apply change locally under <code>/etc/apache2/sites-enabled/</code>
**<code>sudo apache2ctl restart</code>
* Test your change by making relevant HTTP request. See [[Debugging in production#Debugging a web request|Debugging in production]] for how.
* When you're done, <code>mwdebug####$ sudo puppet agent --enable</code>
==Deploying config==
It is suggested that you may wish to place any configuration updates on the [[Deployments]] page.  A bad configuration going live can easily result in a site outage.
* Test your change in deployment-prep and make sure that it works as expected.


* Submit change to [[gerrit]] in the <code>modules/mediawiki/files/apache/sites</code> directory (project: operations/puppet)
==Service==
* Disable puppet across the affected mediawiki application servers.
Puppet roles:
** Cumin can in finding the precise set of hosts. For example, this is a recent query: <syntaxhighlight lang="bash">
* <code>mediawiki::appserver</code>, <code>mediawiki::canary_appserver</code>
cumin 'R:File = "/etc/apache2/sites-available/04-remnant.conf"' 'disable-puppet "elukey - precaution for https://gerrit.wikimedia.org/r/#/c/380774/"' -b 10
* <code>mediawiki::appserver::api</code>, <code>mediawiki::appserver::canary_api</code>
</syntaxhighlight> In this case the change was related to a RewriteRule change in '''04-remnant.conf''', but of course it must be changed every time with the file(s) modified by the Gerrit change.
* <code>mediawiki::maintenance</code>
* Merge via gerrit and run on puppetmaster1001 the usual <code>puppet-merge</code>
* <code>mediawiki::jobrunner</code>


* Create a plain text file with some significant URLs that should be modified by the Gerrit change. Some examples of files with testing URLs, are deployed in <code>/usr/local/share/apache-tests/</code> on the deployment hosts. This text file will be used on deploy1001 with '''apache-fast-test''' later on to verify that  the change works as expected.
Relevant puppet classes:
** For example, if you are adding or modifying a new RewriteRule, please add to your text file some URLs that are expected to change.
* <code>[https://gerrit.wikimedia.org/g/operations/puppet/+/HEAD/modules/profile/manifests/mediawiki/webserver.pp profile::mediawiki::webserver]</code>, this provisions Apache, and any other packages or resources needed by MediaWiki on app servers.
* Go to one of the '''mwdebug''' servers and enable/run puppet. Apache will reload its configuration automatically, please check that no error messages are emitted. Running '''apachectl -t''' after running puppet surely helps verifying that the new configuration is syntactically correct (it doesn't absolutely imply that it will work as intended of course).
** <code>[https://gerrit.wikimedia.org/g/operations/puppet/+/HEAD/modules/profile/manifests/mediawiki/httpd.pp profile::mediawiki::httpd]</code>, the Apache service.
** Some Apache directive changes need a full restart to get applied, not a simple reload. These changes are very rare and they are clearly indicated in Apache's documentation, so please verify it beforehand. Simple RewriteRule changes require only an Apache reload. 
** <code>[https://gerrit.wikimedia.org/r/plugins/gitiles/operations/puppet/+/HEAD/modules/mediawiki/manifests/web/prod_sites.pp meidawiki::web::prod_sites]</code>, the Apache configuration for all production websites (including wikipedia.org).
* On deploy1001 run '''apache-fast-test''' against the selected '''mwdebug''' host using <code>/usr/local/share/apache-tests/baseurls</code> '''and your new test file'''. '''Both of them need to return a positive confirmation that everything looks good.'''
** Additional Apache configurations are at [https://github.com/wikimedia/operations-puppet/tree/production/modules/mediawiki/files/apache/sites modules/mediawiki/files/apache/sites/]. Prior to 2012, Apache configuration were in a Subversion repository.
** Example of usage related to the previously mentioned change (https://gerrit.wikimedia.org/r/#/c/380774):
<syntaxhighlight lang="bash">
elukey@deploy1001:~$ apache-fast-test /usr/local/share/apache-tests/baseurls mwdebug1001.eqiad.wmnet
testing 19 urls on 1 servers, totalling 19 requests
spawning threads..


http://elefante-a-pallini.ro.sa/
==Architecture==
* 200 OK 929
{{See|See [[MediaWiki at WMF#Infrastructure|MediaWiki at WMF § Infrastructure]] for the CDN and traffic layers outside app servers. <br> See also [[MediaWiki at WMF#MediaWiki_configuration|MediaWiki configuration]].}}
http://wikimedia.org/research
* 301 Moved Permanently https://wikimedia.qualtrics.com/SE/?SID=SV_6R04ammTX8uoJFP
http://www.wikipedia.org/wiki/it:Francesco_Totti
* 302 Found http://it.wikipedia.org/wiki/Francesco_Totti
http://zero.wikipedia.org/
* 302 Found http://en.zero.wikipedia.org/wiki/Special:ZeroRatedMobileAccess
[.. cut ..]
* 301 Moved Permanently https://meta.wikimedia.org/wiki/Special:UrlShortener


The application servers are load-balanced via [[LVS]]. Connections between our CDN (HTTP cache proxies) and app servers are encrypted with TLS, which is terminated locally on the app server using a simple '''Nginx-''' install. Nginx then hands the request off to the local Apache.


elukey@deploy1001:~$ apache-fast-test wikidata_redirect mwdebug1001.eqiad.wmnet
'''Apache''' there is in charge of handling redirects, rewrite rules, and determining the [[MediaWiki at WMF#Document root|document root]]. It then uses <code>php-fpm</code> to invoke the MediaWiki software.
testing 1 urls on 1 servers, totalling 1 requests
spawning threads..


https://commons.wikimedia.org/data/main/Data:Bundestagswahl2017/wahlkreis46.map
The Apache [https://httpd.apache.org/docs/2.4/mpm.html MPM] we use is [https://httpd.apache.org/docs/2.4/mod/worker.html mod_worker]</code>, which decides how <code>php-fpm</code> processes are spawned.
* 301 Moved Permanently https://commons.wikimedia.org/wiki/Special:PageData/main/Data:Bundestagswahl2017/wahlkreis46.map
</syntaxhighlight>
* Enable/Run puppet on another mediawiki application server that is taking traffic, de-pooling it beforehand via confctl. Verify again from deploy1001 that everything is working as expected, running apache-fast-test.
* Repool the host mentioned above and verify on Apache access logs that everything looks fine. If you want to be extra paranoid, you can check the host level metrics via https://grafana.wikimedia.org/d/000000327/apache-hhvm?orgId=1 and make sure that nothing is out of the ordinary.
* Re-enable puppet across the appservers previously disabled via cumin.
* Keep an eye on the operations channel and make sure that puppet runs fine on these hosts.


==Logging==
==Logging==
Line 138: Line 106:
Host: en.wiktionary.org
Host: en.wiktionary.org
User-agent: testthing
User-agent: testthing
</pre>
</pre>
== Hardware Repair ==
==== Application Servers ====
When taking down application servers (running mediawiki) for things like disk replacement or other hardware repair, _do not forget to_:
* before: remove from dsh group
These are in puppet, operations/puppet repo, in modules/dsh/files/group. The important one for Mediawiki sync is "mediawiki-installation".
* before: de-pool in pybal
* TODO: Document what to do if it's a scap proxy (see hieradata/common/dsh/config.yaml)
See [[pybal]]. You can just grep for the server name and set 'enabled': False and save.
* before: check nobody is scapping right now (best: announce with a !log line in IRC)
This is an IRC thing on freenode in #wikimedia-dev/-tech/-operations
* during: acknowledge Icinga monitoring checks (best: with related ticket number as comment)
Do this by logging in via browser on icinga.wikimedia.org. search for the hostname, check all services and use the "acknowledge" option. You'll see the IRC bots outputting this as well and they will stop repeating things over and over in the channels.
* after: re-add to dsh groups
Revert the above.
* after: re-pool in pybal
Revert the above.
== Adding a new server into production ==
* Create DNS patch to assign IP addresses to them. This is usually done by dcops nowadays but they might want your review for it. ([https://gerrit.wikimedia.org/r/c/operations/dns/+/571785 example change])
* Create a puppet patch that adds the servers with the right regexes in site.pp. Apply the spare::system puppet role. ([https://gerrit.wikimedia.org/r/c/operations/puppet/+/572975 example change])
* [[Mcrouter#Generate_certs_for_a_new_host|Create mcrouter certs]], merge them in the private puppet repo on the puppetmaster (as of today [[Puppetmaster1001]]).
* Create a patch to add fake certs in the '''labs/private''' repo. Merge it. In the labs/private repo you have to also add the V+2 yourself, no jenkins. ([https://gerrit.wikimedia.org/r/c/labs/private/+/573002 example change])
* Decide which role this server should have (appserver, API appserver, jobrunner,..). Use [https://netbox.wikimedia.org/ Netbox] to search for the host and see which rack it is in. Try to balance server roles across both racks and rows.
* Create a puppet patch that adds the proper role to the servers and adds them in conftool-data in the right section. Don't merge it yet. ([https://gerrit.wikimedia.org/r/c/operations/puppet/+/573019 example change])
* Disable puppet on Icinga to [[Icinga#Avoid_Icinga_spam_on_new_server_installs|avoid Icinga alert spam]]. ex: '''[icinga1001:~] $ sudo puppet agent --disable <reason/ticket ID>'''
* Schedule Icinga downtimes for your new hosts for 1h. ex: '''dzahn@cumin1001:~$ sudo cookbook sre.hosts.downtime -r new_install -t T236437 -H 1 mw13[63,74-83].eqiad.wmnet'''
* Merge the patch to add puppet roles to the new servers.
* Force a puppet run via cumin. Some errors are normal in the first puppet run. ex: '''dzahn@cumin1001:~$ sudo -i cumin -b 15 'mw13[63,74-83].eqiad.wmnet' 'run-puppet-agent -q''''
* Force a second puppet run via cumin. It should complete successfully.
* Re-enable puppet on icinga: '''[icinga1001:~] $ sudo puppet agent --enable'''
* Run downtime with force-puppet-run via cumin ex: '''dzahn@cumin1001:~$ sudo cookbook sre.hosts.downtime -r new_install -t T236437 -H 1 --force-puppet mw13[63,74-83].eqiad.wmnet'''
* Watch all (new) [https://icinga.wikimedia.org/icinga/ Icinga] alerts on the hosts turn green, make sure Apache does not have to be restarted. You can "reschedule next service check" in the Icinga web UI to speed things up. It is expected that the "not in dsh group" alert stays CRIT until the server is pooled below. Once all alerts besides that one are green (not PENDING and not CRIT) it is ok to go ahead.
* Check for ongoing deployments. Wait if that is the case. You can use "jouncebot: now" on IRC and/or the [https://wikitech.wikimedia.org/wiki/Deployments Deployment page] on Wikitech wiki.
* Run "scap pull" on new servers to ensure latest MediaWiki version deployed is present.
* Give the server a weight with confctl: ex: '''[cumin1001:~] $ sudo -i confctl select name=mw1355.eqiad.wmnet set/weight=30'''
* Pool the server with confctl: ex: '''[cumin1001:~] $ sudo -i confctl select name=mw1353.eqiad.wmnet set/pooled=yes'''
* Watch [https://grafana.wikimedia.org/d/000000377/host-overview?orgId=1&refresh=5m Grafana Host Overview], select server and see it is getting traffic.
== Spreading application servers out across rows and racks ==
We are aiming to spread out application server roles (regular appserver, API appserver, etc) across both rows (ex. B) as well as racks (ex. B3) in each of the main data centers (currently eqiad and codfw).
Our new pattern to achieve this is alternating between appserver and API appserver in each row where odd numbers represent appservers and even numbers represent API appservers.
example:
<pre>
mw1385 - appserver  - rack A5
mw1386 - API server - rack A5
mw1387 - appserver  - rack A5
mw1388 - API server - rack A5
..
</pre>
In puppet's site.pp this results in a structure with regexes like this:
<pre>
## DATACENTER: EQIAD
..
# Appservers
# Row A
..
# rack A5
node /^mw13(8[579]|91)\.eqiad\.wmnet$/ {
    role(mediawiki::appserver)
}
...
# rack A5
node /^mw13(8[68]|9[02])\.eqiad\.wmnet$/ {
    role(mediawiki::appserver::api)
}
# Row B
...
## DATACENTER: CODFW
..
# Appservers
# Row A
..
# rack A4
</pre>
In this example rack A5 is split across the 2 roles and ideally the same pattern should repeat for each rack in each row in each datacenter.


== See also ==
== See also ==
* [[Application servers/Runbook#DC Ops]]
* [[Apache log format]]
* [[Apache log format]]
* [[UID]]
* [[UID]]


[[Category:Servers by usage| Apache]]
[[Category:Servers by usage| Apache]]
[[Category:MediaWiki production| ]]

Revision as of 20:30, 17 March 2020

The Application servers (or app servers) are the several hundred Apache servers that run the MediaWiki backend software (written in PHP).

Service

Puppet roles:

  • mediawiki::appserver, mediawiki::canary_appserver
  • mediawiki::appserver::api, mediawiki::appserver::canary_api
  • mediawiki::maintenance
  • mediawiki::jobrunner

Relevant puppet classes:

Architecture

The application servers are load-balanced via LVS. Connections between our CDN (HTTP cache proxies) and app servers are encrypted with TLS, which is terminated locally on the app server using a simple Nginx- install. Nginx then hands the request off to the local Apache.

Apache there is in charge of handling redirects, rewrite rules, and determining the document root. It then uses php-fpm to invoke the MediaWiki software.

The Apache MPM we use is mod_worker, which decides how php-fpm processes are spawned.

Logging

Apache errors are logged to /srv/mw-log/apache2.log on mwlog1001.

Apache access logs are mostly disabled. Statistics are drawn from Varnish front ends instead.

Apache setup checklist

  • apt-get update && apt-get dist-upgrade -y && apt-get install wikimedia-task-appserver && reboot && exit
  • Wait for the server to come back online, ensure it starts apache correctly
    • echo 'GET /' | nc localhost 80 or any of the number of tests listed below
  • If the server is part of the memcached group, follow instructions on Memcached
  • If the server is new, you will need to do the following:
  • Login to the LVS server for apaches (lvs3 as of 2009-02-13) and add the new servers to /etc/pybal/apaches
  • If the server is not new do the following:
  • Ensure the server is now enabled in pybal on the LVS server in the file /etc/pybal/apaches
  • You will need to add the server to DSH groups if new, or check if they are commented, if the server is not new:
  • Add/Uncomment the host to /usr/local/dsh/node_groups/apaches and mediawiki-installation, as well as any other groups needed
  • Reload nagios to accept the changes to the node groups:
  • cd /home/wikipedia/conf/nagios && ./sync
  • Verify that the server is tacking traffic and doing work
  • ipvsadm -L | grep SERVERNAME
  • traffic logs?

Test cases

Here are some test cases you can use to test the apache configuration after changing something.

GET /wiki/Foo HTTP/1.1
Host: en.wikipedia.org
User-agent: testthing

GET /wiki/Foo HTTP/1.1
Host: www.wikipedia.org
User-agent: testthing

GET /wiki/Main_Page HTTP/1.1
Host: www.wikipedia.com
User-agent: testthing

GET / HTTP/1.1
Host: wikipedia.com
User-agent: testthing

GET / HTTP/1.1
Host: wikibooks.org
User-agent: testthing

GET / HTTP/1.1
Host: wikiquote.org
User-agent: testthing

GET / HTTP/1.1
Host: dk.wikipedia.org
User-agent: testthing

GET / HTTP/1.1
Host: foo.wikipedia.org
User-agent: testthing

GET /wiki/Main_Page HTTP/1.1
Host: test.wikipedia.org
User-agent: testthing

GET /wiki/Foo HTTP/1.1
Host: en.wikipedia.org
User-Agent: Exalead

GET /wiki/Foo HTTP/1.1
Host: meta.wikimedia.org
User-agent: testthing

GET / HTTP/1.1
Host: en.wiktionary.org
User-agent: testthing

See also