You are browsing a read-only backup copy of Wikitech. The live site can be found at wikitech.wikimedia.org

RESTBase: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Eevans
(→‎What to check after a deploy: even more post-deploy checks)
imported>Hnowlan
 
(16 intermediate revisions by 7 users not shown)
Line 1: Line 1:
{{Draft}}
{{Draft}}
{{Fixme|1=This document needs expansion}}


[[mw:RESTBase|RESTBase]] is an API proxy serving the REST API at <code>/api/rest_v1/</code>. It uses Cassandra as a storage backend.
[[mw:RESTBase|RESTBase]] is an API proxy serving the REST API at <code>/api/rest_v1/</code>. It uses Cassandra as a storage backend.


It is currently running on restbase100{1..9}.eqiad.wmnet, and shares the hardware with Cassandra instances.
It is currently running on hosts with the <code>profile::restbase</code> class.


== Deployment and config changes ==
== Deployment and config changes ==


=== Getting the Ansible deploy scripts ===
RESTBase is deployed by [[Scap]].
We are using a set of simple [https://github.com/wikimedia/ansible-deploy.git Ansible deploy scripts] to coordinate rolling deploys and restarts. These are currently not installed on a deploy host (FIXME!), so you need to check them out locally:
 
<code>git clone https://github.com/wikimedia/ansible-deploy.git</code>
 
The scripts assume that you have a working SSH proxy command setup, so that <code>ssh restbase1001.eqiad</code> works. The following ansible commands are assumed to be executed from within the ansible-deploy checkout (so <code>cd ansible-deploy</code>).
 
=== Preparing the deploy repository ===
RESTBase is a service-runner based application, to prepare the software repository for deploy, follow the instructions on updating, [https://github.com/wikimedia/service-template-node/blob/master/doc/deployment.md#update here].
 
=== Deploying to staging ===
Before deploying to production, we generally deploy to the staging cluster (xenon, praseodymium and cerium) first. We deploy via Ansible, which handles the full rolling deploy, including restarts and checks.
 
In your local copy of the [https://github.com/wikimedia/ansible-deploy.git ansible-deploy] tree, edit <code>group_vars/restbase-staging</code>, and set <code>restbase_version</code> to the SHA1 sum of the [[#Preparing_the_deploy_repository|deploy repository]] HEAD, and run <code>ansible-playbook -i staging -e target=restbase roles/restbase/deploy.yml</code>
 
Tip: It is common to first deploy to a "canary node", and evaluate the impact there before proceeding.  To limit the deploy to some subset of hosts, use the <code>-l</code> argument: i.e. <code>ansible-playbook -i staging <code>-e target=restbase</code>-l xenon.* roles/restbase/deploy.yml</code>. Regular expressions are also supported, which is especially useful for numbered hosts in production: <code>-l ~restbase100[1-2].*</code>.
 
=== Deploying to production ===
If things went well in staging, then you can proceed to deploy to production.
 
In your local copy of the [https://github.com/wikimedia/ansible-deploy.git ansible-deploy] tree, edit <code>group_vars/restbase-production</code>, and set <code>restbase_version</code> to the SHA1 sum of the [[#Preparing_the_deploy_repository|deploy repository]] HEAD, then <code>ansible-playbook -i production <code>-e target=restbase</code>roles/restbase/deploy.yml</code>
 
Tip: It is common to first deploy to a "canary node", and evaluate the impact there before proceeding.  To limit the deploy to some subset of hosts, use the <code>-l</code> argument: i.e. <code>ansible-playbook -i production <code>-e target=restbase</code>-l xenon.* roles/restbase/deploy.yml</code>. Regular expressions are also supported, which is especially useful for numbered hosts in production: <code>-l ~restbase100[1-2].*</code>.
 
=== Rolling back a deploy ===
Modify <code>restbase_version</code> in <code>group_vars/restbase-{production,staging}</code> from the revision deployed, to the revision you'd like to roll back to. Then, deploy as usual:  <code>ansible-playbook -i production <code>-e target=restbase</code>roles/restbase/deploy.yml</code>
 
=== Performing a rolling restart ===
From an ansible tree: <code>ansible-playbook -i production <code>-e target=restbase</code>roles/restbase/restart.yml</code>
 
''Substitute the value given for <code>-i</code> as necessary.''
 
=== Doing Dry Runs ===
Each of the <code>ansible-playbook</code> commands above can be invoked with the <code>--check</code> and <code>--diff</code> flags to get an indication of what the effect will be, without actually making any changes.
 
=== Deploy configuration changes ===
As config changes can trigger database changes in RESTBase, it is [[Incident_documentation/20150519-RESTBase|very important]] that those are deployed in a rolling fashion as well. The configuration templating is handled by puppet, which doesn't directly support rolling deploys. To work around this, we need to manually perform a rolling deploy by disabling puppet & then re-enabling it one by one. Procedure (note: all of the following commands need to be run as root):
 
* Disable puppet on all restbase* hosts, to make sure that config changes are applied one host at a time: <code>puppet agent --disable</code>
* For each node:
** re-enable / run puppet: <code>puppet agent --enable; puppet agent -tv</code>
** restart restbase with <code>systemctl restart restbase</code>
** verify that RB is back up with <code>curl http://<boxip>:7231/</code>
 
TODO: Integrate with safe rolling restarts above


=== What to check after a deploy ===
=== What to check after a deploy ===
Deploys to do not always go according to plan, and regressions are not always obvious.  Here is a list of things you should check after each deploy:
Deploys to do not always go according to plan, and regressions are not always obvious.  Here is a list of things you should check after each deploy:
* Does the [http://en.wikipedia.org/api/rest_v1/?doc API documentation] still load?  Consider exercising some of the endpoints from the UI (perhaps by [http://rest.wikimedia.org/en.wikipedia.org/v1/page/html/Foobar requesting an html render]).
* Does the [http://en.wikipedia.org/api/rest_v1/?doc API documentation] still load?  Consider exercising some of the endpoints from the UI (perhaps by [https://en.wikipedia.org/api/rest_v1/page/html/Foobar requesting an html render]).
* Check error logs in [https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase logstash].
* Check error logs in [https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase logstash].
* Have a look at the metrics in [http://grafana.wikimedia.org Grafana].  Have [http://grafana.wikimedia.org/#/dashboard/db/restbase latencies increased, or error rates jumped]?  Is [http://grafana.wikimedia.org/#/dashboard/db/restbase?panelId=4&fullscreen memory utilization] consistent with expectations?  What about storage ([http://grafana.wikimedia.org/#/dashboard/db/cassandra-restbase-eqiad op rates], [http://grafana.wikimedia.org/#/dashboard/db/restbase-cassandra-storage exceptions], etc)?
* Have a look at the metrics in [http://grafana.wikimedia.org Grafana].  Have [https://grafana.wikimedia.org/d/000000068/restbase?orgId=1&panelId=16&fullscreen&from=now-1h&to=now latencies increased], or [https://grafana.wikimedia.org/d/000000068/restbase?orgId=1&panelId=18&fullscreen&from=now-1h&to=now error rates jumped]?  Is [https://grafana.wikimedia.org/d/000000068/restbase?orgId=1&panelId=4&fullscreen&from=now-1h&to=now memory utilization] consistent with expectations?  What about storage ([https://grafana.wikimedia.org/d/000000418/cassandra?orgId=1&var-datasource=eqiad%20prometheus%2Fservices&var-cluster=restbase&var-keyspace=commons_T_page__summary&var-table=data&var-quantile=99p op rates], exceptions, etc)?
* Consider making an edit to a page using Visual Editor.
* Consider making an edit to a page using Visual Editor.
* Take a look at some recent Visual Editor-performed changes ([https://fr.wikipedia.org/w/index.php?title=Spécial:Modifications_récentes&namespace=&tagfilter=visualeditor French Wikipedia works great for this], as they use VE by default).  Do the diffs looks reasonable?
* Take a look at some recent Visual Editor-performed changes ([https://fr.wikipedia.org/w/index.php?title=Spécial:Modifications_récentes&namespace=&tagfilter=visualeditor French Wikipedia works great for this], as they use VE by default).  Do the diffs looks reasonable?
Line 64: Line 21:
=== Other considerations ===
=== Other considerations ===
Be sure to log all actions ahead of time in <code>#wikimedia-operations</code>.  Don't be shy about including details.
Be sure to log all actions ahead of time in <code>#wikimedia-operations</code>.  Don't be shy about including details.
== Administration ==
=== Adding a new RESTBase host ===
Before following these instructions, ensure you follow the [[Cassandra#Add a new host to a multi-instance cluster|provisioning documentation for a new Cassandra node.]]
* Add [[phab:rGRBD4ad65b00720f2f8926a0bd2c45c71988deb02266|hosts to the deployment list]] in the Restbase deploy repo
* If there have been changes to the restbase service since you applied the correct roles to the host  (the latest deployed version should be pulled via Puppet during the first puppet runs), deploy restbase to the hosts: from deployment.eqiad.wmnet, <code>cd /srv/deployment/restbase/deploy/</code>, <code>git pull</code> and then <code>scap deploy -f -l restbaseNNNN.DC.wmnet "First deploy to restbaseNNNN"</code>
* [[gerrit:c/operations/puppet/+/632497|Add the hosts to conftool-data]]
* If the hosts are healthy in Icinga at this point and if you feel it is safe as regards deployment timing and so on, pool the hosts: 
** <code>sudo confctl select name=restbaseNNNN.DC.wmnet  set/pooled=yes:weight=10</code>
* Verify that the hosts have been added and are healthy via [https://config-master.wikimedia.org/pybal/codfw/restbase-backend the pybal API]
=== Renewing expired certificates ===
Every now and again Cassandra certificates will come close to expiry (for example: SSL WARNING - Certificate restbase2016-a valid until 2020-11-29 09:26:14 +0000 (expires in 53 days)). Certificates need to be deleted and recreated in the Puppet secrets directory - See the [[Cassandra#Installing and generating certificates|Cassandra documentation]] for details.
== Monitoring ==
=== instance-data ===
In production, the <code>instance-data</code> path is usually a RAID array. It is used for hints, commitlogs and caches - all vital to the stable operation of the Cassandra instances. Under unusual circumstances (a large rebalancing, an instance behaving erroneously etc) this mount can fill up quickly and space will sometimes be required to back out of this condition. For this reason, we set a lower threshold for disk free on this path than for other disks.


== Debugging ==
== Debugging ==
Line 79: Line 55:


The <code>-n 0</code> parameter avoids forking off any workers, which reduces log noise. Instead, a single worker is started up right in the master process.
The <code>-n 0</code> parameter avoids forking off any workers, which reduces log noise. Instead, a single worker is started up right in the master process.
== Analytics and metrics ==
* [https://grafana.wikimedia.org/dashboard/db/restbase?orgId=1 RESTBase grafana dashboard]
* [https://grafana.wikimedia.org/dashboard/db/api-summary?orgId=1 API summary dashboard]
[[Analytics/Systems/Cluster/Hive/Queries|Hive query]] for action API & rest API traffic:<syntaxhighlight lang="sql">
use wmf;
SELECT
  SUM(IF (uri_path LIKE '/api/rest_v1/%', 1, 0)) as count_rest,
  SUM(IF (uri_path LIKE '/w/api.php%', 1, 0)) as count_action
FROM wmf.webrequest
WHERE webrequest_source = 'text'
  AND year = 2017
  AND month = 9
  AND (uri_path LIKE '/api/rest_v1/%' OR uri_path LIKE '/w/api.php%');
</syntaxhighlight>
[[Category:Services]]

Latest revision as of 17:10, 14 December 2021

RESTBase is an API proxy serving the REST API at /api/rest_v1/. It uses Cassandra as a storage backend.

It is currently running on hosts with the profile::restbase class.

Deployment and config changes

RESTBase is deployed by Scap.

What to check after a deploy

Deploys to do not always go according to plan, and regressions are not always obvious. Here is a list of things you should check after each deploy:

Other considerations

Be sure to log all actions ahead of time in #wikimedia-operations. Don't be shy about including details.

Administration

Adding a new RESTBase host

Before following these instructions, ensure you follow the provisioning documentation for a new Cassandra node.

  • Add hosts to the deployment list in the Restbase deploy repo
  • If there have been changes to the restbase service since you applied the correct roles to the host (the latest deployed version should be pulled via Puppet during the first puppet runs), deploy restbase to the hosts: from deployment.eqiad.wmnet, cd /srv/deployment/restbase/deploy/, git pull and then scap deploy -f -l restbaseNNNN.DC.wmnet "First deploy to restbaseNNNN"
  • Add the hosts to conftool-data
  • If the hosts are healthy in Icinga at this point and if you feel it is safe as regards deployment timing and so on, pool the hosts:
    • sudo confctl select name=restbaseNNNN.DC.wmnet  set/pooled=yes:weight=10
  • Verify that the hosts have been added and are healthy via the pybal API

Renewing expired certificates

Every now and again Cassandra certificates will come close to expiry (for example: SSL WARNING - Certificate restbase2016-a valid until 2020-11-29 09:26:14 +0000 (expires in 53 days)). Certificates need to be deleted and recreated in the Puppet secrets directory - See the Cassandra documentation for details.

Monitoring

instance-data

In production, the instance-data path is usually a RAID array. It is used for hints, commitlogs and caches - all vital to the stable operation of the Cassandra instances. Under unusual circumstances (a large rebalancing, an instance behaving erroneously etc) this mount can fill up quickly and space will sometimes be required to back out of this condition. For this reason, we set a lower threshold for disk free on this path than for other disks.

Debugging

To temporarily switch to local logging for debugging, you can change the config.yaml log stanza like this:

logging:
  name: restbase
  streams:
    # level can be trace, debug, info, warn, error
    - level: info 
      path: /tmp/debug.log

Alternatively, you can log to stdout by commenting out the streams sub-object. This is useful for debugging startup failures like this:

cd /srv/deployment/restbase/deploy/
sudo -u restbase node restbase/server.js -c /etc/restbase/config.yaml -n 0

The -n 0 parameter avoids forking off any workers, which reduces log noise. Instead, a single worker is started up right in the master process.

Analytics and metrics

Hive query for action API & rest API traffic:

use wmf;

SELECT
  SUM(IF (uri_path LIKE '/api/rest_v1/%', 1, 0)) as count_rest,
  SUM(IF (uri_path LIKE '/w/api.php%', 1, 0)) as count_action
FROM wmf.webrequest
WHERE webrequest_source = 'text'
  AND year = 2017
  AND month = 9
  AND (uri_path LIKE '/api/rest_v1/%' OR uri_path LIKE '/w/api.php%');