You are browsing a read-only backup copy of Wikitech. The primary site can be found at wikitech.wikimedia.org

RESTBase: Difference between revisions

From Wikitech-static
Jump to navigation Jump to search
imported>Eevans
m (indicate the work-in-progress nature of the page)
imported>Eevans
(→‎What to check after a deploy: even more post-deploy checks)
Line 20: Line 20:
Before deploying to production, we generally deploy to the staging cluster (xenon, praseodymium and cerium) first. We deploy via Ansible, which handles the full rolling deploy, including restarts and checks.
Before deploying to production, we generally deploy to the staging cluster (xenon, praseodymium and cerium) first. We deploy via Ansible, which handles the full rolling deploy, including restarts and checks.


In the ansible tree: <code>ansible-playbook -i staging -e target=restbase roles/restbase/deploy.yml</code>
In your local copy of the [https://github.com/wikimedia/ansible-deploy.git ansible-deploy] tree, edit <code>group_vars/restbase-staging</code>, and set <code>restbase_version</code> to the SHA1 sum of the [[#Preparing_the_deploy_repository|deploy repository]] HEAD, and run <code>ansible-playbook -i staging -e target=restbase roles/restbase/deploy.yml</code>


Tip: You can also limit the deploy to some hosts only: <code>ansible-playbook -i staging <code>-e target=restbase</code>-l xenon.* roles/restbase/deploy.yml</code>. Regexps are also supported, which is especially useful for numbered hosts in production: <code>-l ~restbase.100[1-2].*</code>
Tip: It is common to first deploy to a "canary node", and evaluate the impact there before proceeding.  To limit the deploy to some subset of hosts, use the <code>-l</code> argument: i.e. <code>ansible-playbook -i staging <code>-e target=restbase</code>-l xenon.* roles/restbase/deploy.yml</code>. Regular expressions are also supported, which is especially useful for numbered hosts in production: <code>-l ~restbase100[1-2].*</code>.


=== Deploying to production ===
=== Deploying to production ===
If things went well in staging, then you can proceed to deploy to production.
If things went well in staging, then you can proceed to deploy to production.


In the ansible tree: <code>ansible-playbook -i production <code>-e target=restbase</code>roles/restbase/deploy.yml</code>
In your local copy of the [https://github.com/wikimedia/ansible-deploy.git ansible-deploy] tree, edit <code>group_vars/restbase-production</code>, and set <code>restbase_version</code> to the SHA1 sum of the [[#Preparing_the_deploy_repository|deploy repository]] HEAD, then <code>ansible-playbook -i production <code>-e target=restbase</code>roles/restbase/deploy.yml</code>
 
Tip: It is common to first deploy to a "canary node", and evaluate the impact there before proceeding.  To limit the deploy to some subset of hosts, use the <code>-l</code> argument: i.e. <code>ansible-playbook -i production <code>-e target=restbase</code>-l xenon.* roles/restbase/deploy.yml</code>. Regular expressions are also supported, which is especially useful for numbered hosts in production: <code>-l ~restbase100[1-2].*</code>.


=== Rolling back a deploy ===
=== Rolling back a deploy ===
Modify the restbase version in <code>group_vars/restbase</code> from 'master' to the revision you'd like to roll back to. Then, deploy as usual:
Modify <code>restbase_version</code> in <code>group_vars/restbase-{production,staging}</code> from the revision deployed, to the revision you'd like to roll back to. Then, deploy as usual: <code>ansible-playbook -i production <code>-e target=restbase</code>roles/restbase/deploy.yml</code>


In the ansible tree: <code>ansible-playbook -i production <code>-e target=restbase</code>roles/restbase/deploy.yml</code>
=== Performing a rolling restart ===
From an ansible tree: <code>ansible-playbook -i production <code>-e target=restbase</code>roles/restbase/restart.yml</code>


=== Rolling restart ===
''Substitute the value given for <code>-i</code> as necessary.''
In the ansible tree: <code>ansible-playbook -i production <code>-e target=restbase</code>roles/restbase/restart.yml</code>


=== Doing Dry Runs ===
=== Doing Dry Runs ===
Each of the <code>ansible-playbook</code> commands above can be invoked with the <code>--check</code> and <code>--diff</code> flags to get an indication of what the effect will be, without actually making any changes.
Each of the <code>ansible-playbook</code> commands above can be invoked with the <code>--check</code> and <code>--diff</code> flags to get an indication of what the effect will be, without actually making any changes.


=== Deploy config changes ===
=== Deploy configuration changes ===
As config changes can trigger database changes in RESTBase, it is [[Incident_documentation/20150519-RESTBase|very important]] that those are deployed in a rolling fashion as well. The configuration templating is handled by puppet, which doesn't directly support rolling deploys. To work around this, we need to manually perform a rolling deploy by disabling puppet & then re-enabling it one by one. Procedure (note: all of the following commands need to be run as root):
As config changes can trigger database changes in RESTBase, it is [[Incident_documentation/20150519-RESTBase|very important]] that those are deployed in a rolling fashion as well. The configuration templating is handled by puppet, which doesn't directly support rolling deploys. To work around this, we need to manually perform a rolling deploy by disabling puppet & then re-enabling it one by one. Procedure (note: all of the following commands need to be run as root):


Line 51: Line 53:
TODO: Integrate with safe rolling restarts above
TODO: Integrate with safe rolling restarts above


=== After each deploy ===
=== What to check after a deploy ===
* Verify that it's still working: http://en.wikipedia.org/api/rest_v1/?doc
Deploys to do not always go according to plan, and regressions are not always obvious.  Here is a list of things you should check after each deploy:
* Check error logs in https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase
* Does the [http://en.wikipedia.org/api/rest_v1/?doc API documentation] still load?  Consider exercising some of the endpoints from the UI (perhaps by [http://rest.wikimedia.org/en.wikipedia.org/v1/page/html/Foobar requesting an html render]).
* Check error logs in [https://logstash.wikimedia.org/#/dashboard/elasticsearch/restbase logstash].
* Have a look at the metrics in [http://grafana.wikimedia.org Grafana].  Have [http://grafana.wikimedia.org/#/dashboard/db/restbase latencies increased, or error rates jumped]?  Is [http://grafana.wikimedia.org/#/dashboard/db/restbase?panelId=4&fullscreen memory utilization] consistent with expectations?  What about storage ([http://grafana.wikimedia.org/#/dashboard/db/cassandra-restbase-eqiad op rates], [http://grafana.wikimedia.org/#/dashboard/db/restbase-cassandra-storage exceptions], etc)?
* Consider making an edit to a page using Visual Editor.
* Take a look at some recent Visual Editor-performed changes ([https://fr.wikipedia.org/w/index.php?title=Spécial:Modifications_récentes&namespace=&tagfilter=visualeditor French Wikipedia works great for this], as they use VE by default).  Do the diffs looks reasonable?
* Keep a close eye on <code>#wikimedia-operations</code>, if someone spots a problem, they're likely to raise the issue there.


=== Deployment checklist (WIP) ===
=== Other considerations ===
# Prepare the deploy repository, and take note of the Git ID of <code>HEAD</code>
Be sure to log all actions ahead of time in <code>#wikimedia-operations</code>. Don't be shy about including details.
# Update <code>group_vars/<cluster>-staging</code> in ansible-deploy; Set <code>restbase_version</code> (using the Git ID from #1)
# Deploy to staging environment, and test thoroughly
# Update <code>group_vars/<cluster>-production</code> in ansible-deploy; Set <code>restbase_version</code> (using the Git ID from #1)
# If possible, deploy first to canary node:
## Log the action in #wikimedia-operations (i.e. <code>!log canary deploy of afafafaf to restbase1001.eqiad.wmnet</code>)
##


== Debugging ==
== Debugging ==

Revision as of 20:03, 9 October 2015

RESTBase is an API proxy serving the REST API at /api/rest_v1/. It uses Cassandra as a storage backend.

It is currently running on restbase100{1..9}.eqiad.wmnet, and shares the hardware with Cassandra instances.

Deployment and config changes

Getting the Ansible deploy scripts

We are using a set of simple Ansible deploy scripts to coordinate rolling deploys and restarts. These are currently not installed on a deploy host (FIXME!), so you need to check them out locally:

git clone https://github.com/wikimedia/ansible-deploy.git

The scripts assume that you have a working SSH proxy command setup, so that ssh restbase1001.eqiad works. The following ansible commands are assumed to be executed from within the ansible-deploy checkout (so cd ansible-deploy).

Preparing the deploy repository

RESTBase is a service-runner based application, to prepare the software repository for deploy, follow the instructions on updating, here.

Deploying to staging

Before deploying to production, we generally deploy to the staging cluster (xenon, praseodymium and cerium) first. We deploy via Ansible, which handles the full rolling deploy, including restarts and checks.

In your local copy of the ansible-deploy tree, edit group_vars/restbase-staging, and set restbase_version to the SHA1 sum of the deploy repository HEAD, and run ansible-playbook -i staging -e target=restbase roles/restbase/deploy.yml

Tip: It is common to first deploy to a "canary node", and evaluate the impact there before proceeding. To limit the deploy to some subset of hosts, use the -l argument: i.e. ansible-playbook -i staging -e target=restbase-l xenon.* roles/restbase/deploy.yml. Regular expressions are also supported, which is especially useful for numbered hosts in production: -l ~restbase100[1-2].*.

Deploying to production

If things went well in staging, then you can proceed to deploy to production.

In your local copy of the ansible-deploy tree, edit group_vars/restbase-production, and set restbase_version to the SHA1 sum of the deploy repository HEAD, then ansible-playbook -i production -e target=restbaseroles/restbase/deploy.yml

Tip: It is common to first deploy to a "canary node", and evaluate the impact there before proceeding. To limit the deploy to some subset of hosts, use the -l argument: i.e. ansible-playbook -i production -e target=restbase-l xenon.* roles/restbase/deploy.yml. Regular expressions are also supported, which is especially useful for numbered hosts in production: -l ~restbase100[1-2].*.

Rolling back a deploy

Modify restbase_version in group_vars/restbase-{production,staging} from the revision deployed, to the revision you'd like to roll back to. Then, deploy as usual: ansible-playbook -i production -e target=restbaseroles/restbase/deploy.yml

Performing a rolling restart

From an ansible tree: ansible-playbook -i production -e target=restbaseroles/restbase/restart.yml

Substitute the value given for -i as necessary.

Doing Dry Runs

Each of the ansible-playbook commands above can be invoked with the --check and --diff flags to get an indication of what the effect will be, without actually making any changes.

Deploy configuration changes

As config changes can trigger database changes in RESTBase, it is very important that those are deployed in a rolling fashion as well. The configuration templating is handled by puppet, which doesn't directly support rolling deploys. To work around this, we need to manually perform a rolling deploy by disabling puppet & then re-enabling it one by one. Procedure (note: all of the following commands need to be run as root):

  • Disable puppet on all restbase* hosts, to make sure that config changes are applied one host at a time: puppet agent --disable
  • For each node:
    • re-enable / run puppet: puppet agent --enable; puppet agent -tv
    • restart restbase with systemctl restart restbase
    • verify that RB is back up with curl http://<boxip>:7231/

TODO: Integrate with safe rolling restarts above

What to check after a deploy

Deploys to do not always go according to plan, and regressions are not always obvious. Here is a list of things you should check after each deploy:

Other considerations

Be sure to log all actions ahead of time in #wikimedia-operations. Don't be shy about including details.

Debugging

To temporarily switch to local logging for debugging, you can change the config.yaml log stanza like this:

logging:
  name: restbase
  streams:
    # level can be trace, debug, info, warn, error
    - level: info 
      path: /tmp/debug.log

Alternatively, you can log to stdout by commenting out the streams sub-object. This is useful for debugging startup failures like this:

cd /srv/deployment/restbase/deploy/
sudo -u restbase node restbase/server.js -c /etc/restbase/config.yaml -n 0

The -n 0 parameter avoids forking off any workers, which reduces log noise. Instead, a single worker is started up right in the master process.